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Message from the Program Chair 



I wish to welcome all the delegates to the 7th International Conference on High- 
Performance Computing to be held in Bangalore, India, 17-20 December 2000. 
This edition of the conference consists of ten sessions with contributed papers, 
organized as two parallel tracks. We also have six keynotes and an invited papers 
session by leading researchers, industry keynotes, two banquet speeches, a poster 
session, and eight tutorials. 

The technical program was put together by a distinguished program committee 
consisting of five program vice-clrairs, nine special session organizers, and 36 
program committee members. We received 127 submissions for the contributed 
sessions. Each paper was reviewed by three members of the committee and super- 
vised by either a program vice-chair or a special session organizer. After rigorous 
evaluation, 46 papers were accepted for presentation at the conference, of which 
32 were regular papers (acceptance rate of 25%) and 14 were short papers (ac- 
ceptance rate of 11%). The papers that will be presented at the conference are 
authored by researchers from 11 countries, which is an indication of the true 
international flavor of the conference. 

Half of the technical sessions are special sessions. All papers submitted to these 
special sessions were subject to the same review process as described above. The 
sessions are: Applied Parallel Processing organized by Partha Dasgupta and 
Setlruraman Panchanatlran (Arizona State University, USA), Cluster Comput- 
ing and Its Applications organized by Hee Yong Youn (Information and Commu- 
nication University, South Korea), High-Performance Middleware organized by 
Slrikharesh Majumdar (Carleton University, Canada) and Gabriel Kotsis (Uni- 
versity of Vienna, Austria), Large-Scale Data Mining organized by Gautam Das 
(Microsoft Research, USA) and Mohammad Zaki (Rennselaer Polytechnic In- 
stitute, USA) and Wireless and Mobile Communication Systems organized by 
Azzedine Boukerclre (University of North Texas, Denton, USA). The program 
includes an invited papers session by leading computer architects, organized by 
Sriram Vajapeyam (Indian Institute of Science) and myself. In a plenary session 
titled “Future General-Purpose and Embedded Processors,” the speakers will 
share their visions for future processors. The speakers are: Trevor Mudge (Uni- 
versity of Michigan, Ann Arbor, USA), Bob Rau (Hewlett-Packard HP Labs, 
USA), Jim Smith (University of Wisconsin, Madison, USA) and Guri Sohi (Uni- 
versity of Wisconsin, Madison, USA). 

I wish to thank the program vice-chairs and special session organizers for their 
time and effort in the process of selecting the papers and preparing an excel- 
lent technical program. They are Nader Bagherzadeh (University of California at 
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Irvine), Jack Dongarra (University of Tennessee at Knoxville and Oak Ridge Na- 
tional Lab), David Padua (University of Illinois at Urbana-Clrampaign), Assaf 
Schuster (Israel Institute of Technology, Technion), and Satislr Tripathi (Uni- 
versity of California at Riverside). Viktor Prasanna, Sriram Vajapeyam and 
Sajal Das provided excellent feedback about the technical program in their roles 
as general co-chairs and vice general chair, respectively. I would also like to 
thank Sartaj Sahni (University of Florida), Manavenclra Misra (KBkids.com), 
and Vipin Kumar (University of Minnesota) for performing their roles as poster 
session, tutorials session, and keynote address chairs, respectively. I would like 
to express my gratitude to Nalini Venkatasubramanian (University of California 
at Irvine) for compiling the proceedings of the conference. 

I also wish to acknowledge the tremendous support provided by Eduard Ayguade 
(Technical University of Catalonia) . He managed all the work relating to receiv- 
ing papers through mail and web, arranging the electronic reviews, collecting the 
reviews, organizing the reviews in summary tables for the program committee, 
and informing all authors of the decisions. 

Finally, I would like to thank Viktor Prasanna and Sriram Vajapeyam for invit- 
ing me to be part of HiPC 2000 as program chair. 



Mateo Valero 




Message from the General Co-Chairs 



It is our pleasure to welcome you to the Seventh International Conference on 
High Performance Computing. We hope you enjoy the meeting as well as the 
rich cultural heritage of Karnataka State and India. 

The meeting has grown from a small workshop held six years ago that addressed 
parallel processing. Over the years, the quality of submissions has improved and 
the topics of interest have been expanded in the general area of high performance 
computing. The growth in the participation and the continued improvement in 
quality are primarily due to the excellent response from researchers from around 
the world, enthusiastic volunteer effort, and support from IT industries world- 
wide. 

Mateo Valero agreed to be the Program Committee Chair despite his tight sched- 
ule. We are thankful to him for taking this responsibility and organizing an 
excellent technical program. His leadership helped attract excellent Program 
Committee members. It encouraged high quality submissions to the conference 
and invited papers by leading researchers for the session on future processors. 

As Vice General Chair, Sajal Das interfaced with the volunteers and offered his 
thoughtful inputs in resolving meeting-related issues. In addition, he was also in 
charge of the special sessions. It was a pleasure to work with him. 

Eduard Ayguade has redefined the term “volunteer” through his extraordinary 
efforts for the conference. Eduard was instrumental in setting up the Web-based 
paper submission and review process and the conference website, and also helped 
with publicity. We are indeed very grateful to Eduard. 

Vipin Kumar invited the keynote speakers and coordinated the keynotes. Sartaj 
Salmi handled the poster/presentation session. Nalini Venkatasubramanian in- 
terfaced with the authors and Springer- Verlag in bringing out these proceedings. 
Manav Misra put together the tutorials. Venugopal handled publicity within In- 
dia and local arrangements. As in the past, Ajay Gupta did a fine job in handling 
international financial matters. C. P. Ravikumar administered scholarships for 
students from Indian academia. M. Amamiya and J. Torelles handled publicity 
in Asia and Europe, respectively. 

R. Govindarajan coordinated the industrial track and exhibits and also inter- 
faced with sponsors. Dinakar Sitaram, Novell India, provided invaluable inputs 
regarding conference planning. 
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We would like to thank all of them for their time and efforts. Our special thanks 
go to A. K. P. Nambiar for his continued efforts in handling financial matters as 
well as coordinating the activities within India. 

Major financial support for the meeting was provided by several leading IT com- 
panies. We would like to thank the following individuals for their support: 

N. R. Narayana Murthy, Chairman, Infosys; Avinaslr Agrawal, SUN Microsys- 
tems (India); Konrad Lai, Intel Microprocessor Research Labs; Kartlrik Rama- 
rao, HP India; Amitabh Slrrivastava, Microsoft Research; and Uday Shukla, IBM 
(India) . 

Continued sponsorship of the meeting by the IEEE Computer Society and ACM 
are much appreciated. Finally, we would like to thank Henryk Clrrostek and 
Bhaskar Srinivasan for their assistance over the past year. 



October 2000 



Sriram Vajapeyam 
Viktor K. Prasanna 
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Charon Message-Passing Toolkit 
for Scientific Computations 



Rob F. Van der Wijngaart 



Computer Sciences Corporation 
NASA Ames Research Center, Moffett Field, CA 94035, USA 

wi jngaarOnas . nasa.gov 



Abstract. Charon is a library, callable from C and Fortran, that aids 
the conversion of structured-grid legacy codes — such as those used in the 
numerical computation of fluid flows — into parallel, high-performance 
codes. Key are functions that define distributed arrays, that map between 
distributed and non-distributed arrays, and that allow easy specification 
of common communications on structured grids. The library is based 
on the widely accepted MPI message passing standard. We present an 
overview of the functionality of Charon, and some representative results. 



1 Introduction 

A sign of the maturing of the field of parallel computing is the emergence of 
facilities that shield the programmer from low-level constructs such as message 
passing (MPI, PVM) and shared memory parallelization directives (P-Threads, 
OpenMP, etc.), and from parallel programming languages (High Performance 
Fortran, Split C, Linda, etc.). Such facilities include: 1) (semi-)automatic tools 
for parallelization of legacy codes (e.g. CAPTools |8J, ADAPT |5J, CAPO (9J, 
SUIF compiler SMS preprocessor [7], etc.), and 2) application libraries for 
the construction of parallel programs from scratch (KeLP |3] , OVERTURE |S| , 
PETSc 0, Global Arrays |TO], etc.). 

The Charon library described here offers an alternative to the above two ap- 
proaches, namely a mechanism for incremental conversion of legacy codes into 
high-performance, scalable message-passing programs. It does so without the 
need to resort up front to explicit parallel programming constructs. Charon is 
aimed at applications that involve structured discretization grids used for the 
solution of scientific computing problems. Specifically, it is designed to help par- 
allelize algorithms that are not naturally data parallel — i.e. , that contain com- 
plex data dependencies — which include almost all advanced flow solver methods 
in use at NASA Ames Research Center. While Charon provides strictly a set 
of user-callable functions (C and Fortran), it can nonetheless be used to con- 
vert serial legacy codes into highly-tuned parallel applications. The crux of the 
library is that it enables the programmer to codify information about existing 
multi-dimensional arrays in legacy codes and map between these non-distributed 
arrays and newly defined, truly distributed arrays at runtime. This allows the 
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programmer to keep most of the serial code unchanged and only use distributed 
arrays in that part of the code of prime interest. This code section, which is par- 
allelized using more functions from the Charon library, is gradually expanded, 
until the entire code is converted. The major benefit of incremental paralleliza- 
tion is that it is easy to ascertain consistency with the serial code. In addition, 
the user keeps careful control over data transfer between processes, which is 
important on high-latency distributed-memory machines^. 

The usual steps that a programmer takes when parallelizing a code using 
Charon are as follows. First, define a distribution of the arrays in the program, 
based on a division of the grid(s) among all processors. Second, select a section of 
the code to be parallelized, and construct a so-called parallel bypass: map from 
the non-distributed (legacy code) array to the distributed array upon entry of 
the section, and back to the non-distributed array upon leaving it. Third, do the 
actual parallelization work for the section, using more Charon functions. 

The remainder of this paper is structured as follows. In Section 0we explain 
the library functions used to define and manipulate distributed arrays ( distri- 
butions ), including those that allow the mapping between non-distributed and 
distributed arrays. In Section 0 we describe the functions that can be used ac- 
tually to parallelize an existing piece of code. Some examples of the use and 
performance of Charon are presented in Section 0 



2 Distributed Arrays 

Parallelizing scientific computations using Charon is based on domain decompo- 
sition. One or more multi-dimensional grids are defined, and arrays — representing 
computational work -are associated with these grids. The grids are divided into 
nonoverlapping pieces, which are assigned to the processors in the computation. 
The associated arrays are thus distributed as well. This process takes place in 
several steps, illustrated in Fig. 0 and described below. 

First (Fig.0i) , the logically rectangular discretization grid of a certain dimen- 
sionality and extent is defined, using CHN_Create_grid. This step establishes a 
geometric framework for all arrays associated with the grid. It also attaches to 
the grid an MPI EH communicator, which serves as the context and processor 
subspace within which all subsequent Charon-orchestrated communications take 
place. Multiple coincident or non-coincident communicators may be used within 
one program, allowing the programmer to assign the same or different (sets of) 
processors to different grids in a multiple-grid computation. 

Second (Fig. Qa), tessellations of the domain ( sections ) are defined, based 
on the grid variable. The associated library call is CHN_Create_section. Sec- 
tions contain a number of cutting planes (cuts) along each coordinate direc- 
tion. The grid is thus carved into a number of cells , each of which contains a 
logically rectangular block of grid points. Whereas the programmer can spec- 
ify any number of cuts and cut locations, a single call to a high-level routine 

1 We will henceforth speak of processors , even if processes is the more accurate term. 
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a. Define logical grid 
CHN_Create_grid(grid,MPI_COMM_WORLD,2); 
CHN_Set_grid_size(grid,0,nx); 
CHN_Set_grid_size(grid,1 ,ny); 



b. Define sections ssolo,suni 
CHN_Create_section(ssolo,grid); 
CHN_Set_solopartition_cuts(ssolo); 



(tessellations) based on same grid (Step a) 
CHN_Create_section(suni,grid); 
CHN_Set_unipartition_cuts(suni); 



Cell 

J 



Cell 



often suffices to define all the cuts belonging to a particular domain decompo- 
sition. For example, defining a section with just a single cell (i.e. a non-divided 
grid with zero cuts) is accomplished with CHN_Set_solopartition_cuts. Using 
CHN_Set_unipartition_cuts divides the grid evenly into as many cells as there 
are processors in the communicator — nine in this case. 

Third (Fig. Et)j cells 
are assigned to proces- 
sors, resulting in a decom- 
position. The associated 
function is CHN_Create_ 
decomposition. The rea- 
son why the creation of 
section and decomposition 
are separated is to pro- 
vide flexibility. For ex- 
ample, we may divide a 
grid into ten slices for 
execution on a parallel 
computer, but assign all 
slices to the same proces- 
sor for the purpose of de- 
bugging on a serial ma- 
chine. As with the cre- 
ation of sections, we can 
choose to assign each cell 
to a processor individu- 
ally, or make a single call 
to a high-level routine. 

For example, CHN_Set_ 
unipartition_owners as- 
signs each cell in the 
unipartition section to a 
different processor. But 
regardless of the num- 
ber of processors in the 
communicator, CHN_Set_ 
solopartition_ 

.owners assigns all cells 
to the same processor. 

Finally (Fig. Ell) , ar- 
rays with one or more spa- 
tial dimensions (same as 
the grid) are associated 

with a decomposition, resulting in distributions. The associated function is CHN_ 
Create_distribution. The arrays may represent scalar quantities at each grid 



c. Define decompositions csolo.cuni (processor assignments) 
CHN_Create_decomposition(csolo,ssolo); CHN_Create_decomposition(cuni,suni); 
CHN_Set_solopartition_owners(csolo,0); CHN_Set_unipartition_owners(cuni); 



d. Define distributions arr_,arrd_ (distributed arrays; exploded view) 
CHN_Create_distribution(arr_,csolo, CHN_Create_distribution(arrd_,cuni, 



MPI_REAL,arr,0,1 ,5); 



MPI_REAL,arrd,2,1 ,5); 




CHN_Redistribute 

(arrd_,arr_); 



c> 



No ghost points 



gather | 

CHN_Redistribute 

(arr_,arrd_); 



r 

Ghost points 



Fig. 1 . Defining and mapping distributed arrays 
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point, or higher-order tensors. In the example in Fig. EH the tensor rank is 1, and 
the thusly defined vector has 5 components at each grid point. A distribution 
has one of a subset of the regular MPI data types (MPI_REAL in this case). Since 
Charon expressly supports stencil operations on multi-dimensional grids, we also 
specify a number of ghost points. These form a border of points (shaded area) 
around each cell, which acts as a cache for data copied from adjacent cells. In this 
case the undivided distribution has zero ghost points, whereas the unipartition 
distribution has two, which can support higher-order difference stencils (e.g. a 
13-point 3D star). 

The aspect of distributions that sets Charon apart most from other paral- 
lelization libraries is the fact that the programmer also supplies the memory 
occupied by the distributed array; Charon provides a structuring interpretation 
of user space. In the example in Fig. GJ1 it is assumed that arr is the starting 
address of the non-distributed array used in the legacy code, whereas arrd is 
the starting address of a newly declared array that will hold data related to 
the unipartition distribution. By mapping between arr and arrd -using CHN_ 
Redistribute (see also Fig. El) — we can dynamically switch from the serial 
legacy code to truly distributed code, and back. All that is required is that 
the programmer define distribution arr_ such that the memory layout of the 
(legacy code) array arr coincide exactly with the Charon specified layout. This 
act of reverse engineering is supported by functions that allow the programmer 
to specify array padding and offsets, by a complete set of query functions, and 
by Charon’s unambiguously defined memory layout model (see Section FOl . 

CHN_Redistribute can be used not only to construct parallel ‘bypasses’ 
of serial code of the kind demonstrated above, but also to map between any 
two compatible distributions (same grid, data type, and tensor rank). This is 
shown in Fig. 0 where two stripwise distributions, one aligned with the first 
coordinate axis, and the other with the second, are mapped into each other, 
thereby establishing a dynamic transposition. This is useful when there are very 
strong but mutually incompatible data dependencies in different parts of the 
code (e.g. 2D FFT). By default, the unipartition decomposition divides all co- 
ordinate directions evenly, but by excluding certain directions from partitioning 
(CHN_Exclude_partition_direction) we can force a stripwise distribution. 



3 Distributed and Parallel Execution Support 

While it is an advantage to be able to keep most of a legacy code unchanged and 
focus on a small part at a time for parallelization, it is often still nontrivial to 
arrive at good parallel code for complicated numerical algorithms. Charon offers 
support for this process at two levels. 

The first concerns a set of wrapping functions that allows us to keep the 
serial logic and structure of the legacy code unchanged, although the data is 
truly distributed. These functions incur a significant overhead, and are meant 
to be removed in the final version of the code. They provide a stepping stone in 
the parallelization, and may be skipped by the more intrepid programmer. 
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Define stripwise tessellations sx,sy (based on same grid ) 
CHN_Create_section(sx,grid); CHN_Create_section(sy,grid); 

CHN_Exclude_partition_direction(sx,0); CHN_Exclude_partition_direction(sy,1); 

CHN_Set_unipartition_cuts(sx); CHN_Set_unipartition_cuts(sy); 

Define decompositions cx,cy 

CHN_Create_decomposition(cx,sx); CHN_Create_decomposition(cy,sy); 

CHN_Set_unipartition_owners(cx); CHN_Set_unipartition_owners(cy); 

Define distributed arrays ax_,ay_ and perform transpose 

CHN_Create_distribution(ax_,cx,MPI_REAL,ax,0,1,5); CHN_Create_distribution(ay_,cy,MPI_REAL,ay,0,1 ,5); 
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CHN_Redistribute(ax_,ay_); 









Fig. 2. Transposing distributed arrays 



The second is a set of versatile, highly optimized bulk communication func- 
tions that support the implementation of sophisticated data-parallel and -more 
importantly — non-data-parallel numerical methods, such as pipelined algorithms. 

3.1 Wrapping Functions 

In a serial program it is obvious what the statement a ( i , j , k) = b(i+2,j-l,k) 
means, provided a and b have been properly dimensioned, but when these ar- 
rays are distributed across several processors the result is probably not what is 
expected, and most likely wrong. This is due to one of the fundamental complex- 
ities of message-passing, namely that the programmer is responsible for defining 
explicitly and managing the data distribution. Charon can relieve this burden 
by allowing us to write the above assignment as: 
call CHN_Assign(CHN_Address (a_ , i , j ,k) ,CHN_Value(b_ , i+2 , j-1 ,k) ) 
with no regard for how the data is distributed (assuming that a_ and b_ are dis- 
tributions related to arrays a and b, respectively). The benefit of this wrapping 
is that the user need not worry (yet) about communications, which are implicitly 
invoked by Charon, as needed. 

The three functions introduced here have the following properties. CHN_Value 
inspects the distribution b_, determines which (unique) processor owns the grid 
point that holds the value, and broadcasts that value to all processors in the 
communicator (‘ owner serves' rule). CHN_Address inspects the distribution a_ 
and determines which processor owns the grid point that holds the value. If the 
calling processor is the owner, the actual address — an lvalue — is returned, and 
NULL otherwisi [J CHN_Assign stores the value of its second argument at the ad- 
dress in the first argument if the address is not NULL. Consequently, only the 

2 In Fortran return values cannot be used as lvalues, but this problem is easily cir- 
cumvented, since the address is immediately passed to a C function. 
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point owner of the left hand side of the assignment stores the value (‘ owner as- 
signs’ rule). No distinction is made between values obtained through CHN_Value 
and local values, or expressions containing combinations of each; all are rvalues. 
Similarly, no distinction is made between local addresses and those obtained 
through CHN_Address. Hence, the following assignments are all legitimate: 

call CHN_Assign(CHN_Address (a_ , i , j ,k) ,5.0) !1 

call CHN_ Assign ( aux ,CHN_Value (b_ ,i,j-l,k)+1.0) !2 

aux = CHN_Value (b_ ,i,j-l,k)+1.0 !3 

It should be observed that assignments 2 and 3 are equivalent. An important 
feature of wrapped code is that it is completely serialized. All processors execute 
the same statements, and whenever an element of a distributed array occurs on 
the right hand side of an assignment, it is broadcast. As a result, it is guaranteed 
to have the correct serial logic of the legacy code. 

3.2 Bulk Communications 

The performance of wrapped code can be improved by removing the need for 
the very fine-grained, implicitly invoked communications, and replacing them 
with explicitly invoked bulk communications. Within structured-grid applica- 
tions the need for non-local data is often limited to (spatially) nearest-neighbor 
communication; stencil operations can usually be carried out without any com- 
munication, provided a border of ghost points (see Fig. mi) is filled with array 
values from neighboring cells. This fill operation is provided by CHN_Copyf aces, 
which lets the programmer specify exactly which ghost points to update. The 
function takes the following arguments: 

— the thickness of layer of ghost points to be copied; this can be at most the 
number of ghost points specified in the definition of the distribution , 

— the components of the tensor to be copied. For example, the user may wish 
only to transfer the diagonal elements of a matrix, 

— the coordinate direction in which the copying takes place, 

— the sequence number of the cut (defined in the section ) across which copying 
takes place, 

— the rectangular subset of points within the cut to be copied. 

In general, all processors within the grid’s MPI communicator execute CHN_ 
Copyfaces, but those that do not own points involved in the operation may 
safely skip the call. A useful variation is CHN_Copyf aces_all, which fills all 
ghost points of the distribution in all coordinate directions. It is the variation 
most commonly encountered in other parallelization packages for structured-grid 
applications, since it conveniently supports data parallel computations. But it is 
not sufficient to implement, for example, the pipeline algorithm of Section ^3 
The remaining two bulk communications provided by Charon are the previ- 
ously described CHN_Redistribute, and CHN_Gettile. The latter copies a subset 
of a distributed array — which may be owned by several processors — into the local 
memory of a specified processor (cf. Global Arrays’ ga_get PI). This is useful 
for applications that have non-nearest-neighbor remote data dependencies, such 
as non-local boundary conditions for flow problems. 
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3.3 Parallelizing Distributed Code Segments 

Once the remote-data demand has been satisfied through the bulk copying of 
ghost point values, the programmer can instruct Charon to suppress broadcasts 
by declaring a section of the code local (see below) . Within a local section not all 
code should be executed anymore by all processors, since assignments to points 
not owned by the calling processor will usually require remote data not present 
on that calling processor; the code must be restructured to restrict the index sets 
of loops over (parts of) the grid. This is the actual process of parallelization, and 
it is left to the programmer. It is often conceptually simple for structured-grid 
codes, but the bookkeeping matters of changing all data structures at once have 
traditionally hampered such parallelization. The advantage of using Charon is 
that the restructuring focuses on small segments of the code at any one time, 
and that the starting point is code that already executes correctly on distributed 
data sets. The parallelization of a loop nest typically involves the following steps. 

1. Determine the order in which grid cells should be visited to resolve all data 
dependencies in the target parallel code. For example, during the x-sweep in 
the SP code described in Section (Fig. EJ, cells are visited layer by layer, 
marching in the positive ^-direction. In this step all processors still visit all 
cells, and no explicit communications are required, thanks to the wrapper 
functions. This step is supported by query functions that return the number 
of cells in a particular coordinate direction, and also the starting and ending 
grid indices of the cells (useful for computing loop bounds). 

2. Fill ghost point data in advance. If the loop is completely data parallel, a sin- 
gle call to CHN_Copyf aces or CHN_Copyf aces_all before entering the loop 
is usually sufficient to fill ghost point values. If a non-trivial data depen- 
dence exists, then multiple calls to CHN_Copyf aces are usually required. For 
example, in the cc-sweep in the SP code CHN_Copyf aces is called between 
each layer of cells. At this stage all processors still execute all statements in 
the loop nest, so that they can participate in broadcasts of data not resi- 
dent on the calling processor. However, whenever it is a ghost point value 
that is required, it is served by the processor that owns it, rather than the 
processor that owns the cell that has that point as an interior point. This 
seeming ambiguity is resolved by placing calls to CHN_Begin_ghost_access 
and CHN_End_ghost_access around the code that accesses ghost point data, 
which specify the index of the cell whose ghost points should be used. 

3. Suppress broadcasts. This is accomplished by using the bracketing con- 
struct CHN_Begin_local/CHN_End_local to enclose the code that accesses 
elements of distributed arrays. For example: 

call CHN_Begin_local (MPI_C0MM_W0RLD) 

call CHN_Assign(CHN_Address (a_ , i) , CHN_Value (b_ , i+1) -CHN_Value (b_ , i-1) ) 
call CHN_End_local(MPI_COMM_WORLD) 

At the same time, the programmer restricts accessing lvalues to points actu- 
ally owned by the calling processor. This is supported by the query functions 
CHN_Point_owner and CHN_Cell_owner, which return the MPI rank of the 
processor that owns the point and the grid cell, respectively. 
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Once the code segment is fully parallelized, the programmer can strips the 
wrappers to obtain the final, high-performance code. Stripping effectively con- 
sists of translating global grid coordinates into local array indices, a chore that 
is again easily accomplished, due to Charon’s transparent memory layout model. 
By default, all subarrays of the distribution associated with individual cells of 
the grid are dimensioned identically, and these dimensions can be computed in 
advance, or obtained through query functions. For example, assume that the 
number of cells owned by each processor is nmax, the dimensions of the largest 
cell in the grid arc nxxny, and the number of ghost points is gp. Then the array 
w related to the scalar distribution w_ can be dimensioned as follows: 



dimension w(l-gp:nx+gp, l-gp:ny+gp,nmax) 

Assume further that the programmer has filled the arrays beg (2, nmax) and 
end (2, nmax) with the beginning and ending point indices, respectively (using 
Charon query functions), of the cells in the grid owned by the calling processor. 
Then the following two loop nests are equivalent, provided n < nmax. 
do j=beg(2,n) ,end(2,n) do j=l ,end(2,n)-beg(2,n)+l 

do i=beg(l ,n) , end(l ,n) do i=l , end(l ,n) -beg(l ,n)+l 

call CHN_Assign(CHN_Address (w_ , i , j ) , 5 . 0) w(i,j,n) = 5.0 

end do end do 

end do end do 



The above example illustrates the fact that Charon minimizes encapsulation; it 
is always possible to access data related to distributed arrays directly, without 
having to copy data or call access functions. This is a programming convenience, 
as well as a performance gain. Programs parallelized using Charon usually ulti- 
mately only contain library calls that create and query distributions, and that 
perform high-level communications. 

Finally, it should be noted that it is not necessary first to wrap legacy code 
to take advantage of Charon’s bulk communication facilities for the construc- 
tion of parallel bypasses. Wrappers and bulk communications are completely 
independent. 



4 Examples 

We present two examples of numerical problems, SP and LU, with complicated 
data dependencies. Both are taken from the NAS Parallel Benchmarks (NPB) 
ft], of which hand-coded MPI versions (NPB-MPI) and serial versions are freely 
available. They have the form: Au n+1 = b(u n ), where u is the time-dependent 
solution, n is the number of the time step, and b is a nonlinear 13-point-star 
stencil operator. The difference is in the shape of A, the discretization matrix 
that defines the ‘implicitness’ of the numerical scheme. For SP it is effectively: 
Asp = L z LyL x , and for LU: Apjj = L+L-. 



4.1 SP Code 

L z , L y and L x are fourth-order difference operators that determine data depen- 
dencies in the z, y, and x directions, respectively. Asp is numerically inverted 
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in three corresponding phases, each involving the solution of a large number 
of independent banded (penta-diagonal) matrix equations, three for each grid 
line. Each equation is solved using Gaussian elimination, implemented as two 
sweeps along the grid line, one in the positive (forward elimination), and one 
in the negative direction (backsubstitution). The method chosen in NPB-MPI 
is the multipartition (MP) decomposition strategy. It assigns to each processor 
multiple cells such that, regardless of sweep direction, each processor has work 
to do during each stage of the solution process. An example of a 9-processor 3D 
MP is shown in Fig. El Details can be found in (IJ. While most packages we have 
studied do not allow the definition of MP, it is easily specified in Charon: 
call CHN_Create_section(multi_sec ,grid) 
call CHN_Set_multipartition_cuts (multi_sec) 
call CHN_Cr eat e_decompos it ion (mult i_cmp, mult i_ sec) 
call CHN_Set_multipartition_owners (multi_cmp) 

The solution process in the x di- 
rection is as follows (Fig. 0). All 
processors start the forward elim- 
ination on the left side of the grid. 

When the boundary of the first 
layer of cells is reached, the ele- 
ments of the penta-diagonal ma- 
trix that need to be passed to the 
next layer of cells are copied in 
bulk using CHN_Copyf aces. Then 
the next layer of cells is traversed, 
followed by another copy opera- 
tion, etc. The number of floating 
point operations and words com- 
municated in the Charon version 
of the code is exactly the same 
as in NPB-MPI. The only differ- 
ence is that Charon copies the val- 
ues into ghost points, whereas in 
NPB-MPI they are directly used 
to update the matrix system with- 
out going to main memory. The 
latter is more efficient, but comes 
at the cost of a much greater pro- Fi S' 3 ‘ Nine-processor multipartition decom- 

gram complexity, since the com- P osition ’ and soluti ° n P rocess for SP code ^ : 

. , „ ,, . , , forward elimination) 

mumcation must be fuiiy integrat- 
ed with the computation. 

The results of running both Charon and NPB-MPI versions of SP for three 
different grid sizes on an SGI 0rigin2000 (250 MHz MIPS R10000) are shown in 
Fig. El Save for a deterioration at 81 processors for class B (102 3 grid) due to a 
bad stride, the results indicate that the Charon version achieves approximately 
70% of the performance of NPB-MPI, with roughly the same scalability charac- 



Each processor owns 3 cells Step 1 : Update layer of cells 
(proc. no. indicated on cell) 





Step 2: Communicate matrix Step 3: Update layer of cells 
elements (CHN_Copyfaces) 

/N 





Step 4: Communicate matrix Step 5: Update layer of cells 
elements (CHN_Copyfaces) 

/A 
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teristics. While 30% difference is significant, it should be noted that the Charon 
version was derived from the serial code in three days, whereas NPB-MPI took 
more than one month (both by the same author). Moreover, since Charon and 
MPI calls can be freely mixed, it is always possible for the programmer who is 
not satisfied with the performance of Charon communications to do (some of) 
the message passing by hand. 

If SP is run 
on a 10-CPU Sun 
Ultra Enterprise 
4000 Server (250 
MHz Ultrasparc II 
processor), whose 
cache structure 
differs significantly 
from the R10000, 
the difference be- 
tween results for 
the Charon and 
NPB-MPI versions 
shrinks to a mere 
3.5%; on this ma- 
chine there is no 
gain in performan- 
ce through hand 

coding. Fig. 4. Performance of SP benchmark on 250 MHz SGI 

Origin2000 




4.2 LU Code 

L_ and L + are first-order direction-biased difference operators. They define 
two sweeps over the entire grid. The structure of L _ dictates that no point 
(*, j. k) can be updated before updating all points (i p , j p , k p ) with smaller indices: 
{(ip,j p , k p )\i p < i,j p < j, k p < k, ( i p ,j p , k p ) ± ( i,j , k)}. This data dependency is 
the same as for the Gauss-Seidel method with lexicographical point ordering. L + 
sweeps in the other direction. Unlike for SP, there is no concept of independent 
grid lines for LU. The solution method chosen for NPB-MPI is to divide the grid 
into pencils, one for each processor, and pipeline the solution process, Fig. 
call CHN_Cr eat e_sect ion (penc il_sec , grid) 
call CHN_Exclude_partition_direction(pencil_sec , 2) 
call CHN_Set_unipartition_cuts (pencil_sec) 
call CHN_Create_decomposition(uni_cmp ,uni_sec) 
call CHN_Set_unipartition_owners (uni_cmp) 



Each unit of computation is a single plane of points {tile) of the pencil. Once 
a tile is updated, the values on its boundary are communicated to the pencil’s 
Eastern and Northern (for L_) neighbors. Subsequently, the next tile in the 
pencil is updated. Of course, not all boundary points of the whole pencil should 
be transferred after completion of each tile update, but only those of the ‘active’ 
tile. This is easily specified in CHN_Copyf aces. 
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Each processor owns one pencil Step 1 : Update tile 

(proc. no. indicated on pencil) 





Step 2: Communicate tile edges 
(CHN_Copyfaces) 





Fig. 5. Unipartition pencil decomposition (16 processors); Start of pipelined solution 
process for LU code (L_) 




The results of 
both Charon and 
NPB-MPI versions 
of LU for three 
different grid sizes 
on the SGI Ori- 
gin are shown in 
Fig. □ Now the 
performance of the 
Charon code is al- 
most the same as 
that of NPB-MPI. 

This is because both 
use ghost points 
for transferring in- 
formation between 
neighboring pencils. 

Again, on the Sun 
E4000, the perfor- 
mances of the hand 

coded and Charon parallelized programs are nearly indistinguishable. 



Fig. 6. Performance of LU benchmark on 250 MHz SGI 
0rigin2000 



5 Discussion and Conclusions 

We have given a brief presentation of some of the major capabilities of Charon, a 
parallelization library for scientific computing problems. It is useful for those ap- 
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plications that need high scalability, or that have complicated data dependencies 
that are hard to resolve by analysis engines. Since some hand coding is required, 
it is more labor intensive than using a parallelizing compiler or code transfor- 
mation tool. Moreover, the programmer must have some knowledge about the 
application program structure in order to make effective use of Charon. When 
the programmer does decide to make the investment to use the library, the re- 
sults are close to the performance of hand-coded, highly tuned message passing 
implementations, at a fraction of the development cost. More information on the 
toolkit and a user guide are being made available by the author [El- 
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Abstract. We present a framework for computing dynamic slices of 
concurrent programs using a form of dependence graph as intermediate 
representations. We introduce the notion of a Dynamic Program De- 
pendence Graph ( DPDG ) to represent various intra- and interprocess 
dependences of concurrent programs. We construct this graph through 
three hierarchical stages. Besides being intuitive, this approach also en- 
ables us to display slices at different levels of abstraction. We have consid- 
ered interprocess communication using both shared memory and message 
passing mechanisms. 



1 Introduction 

Program slicing is a technique for extracting only those statements from a pro- 
gram which may affect the value of a chosen set of variables at some point of 
interest in the program. Development of an efficient program slicing technique 
is a very important problem, since slicing finds applications in numerous areas 
such as program debugging, testing, maintenance, re-engineering, comprehen- 
sion, program integration and differencing. Excellent surveys on the applications 
of program slicing and existing slicing methods are available in ma . 

The slice of a program P is computed with respect to a slicing criterion 
< s,v >, where s is a statement in P and v is a variable in the statement. The 
backward slice of P with respect to the slicing criterion < s,v > includes only 
those statements of P which affect the value of v at the statement s. Several 
methods for computing slices of programs have been reported EE9- The semi- 
nal work of Weiser proposed computation of slices of a program using its control 
flow graph (CFG) representation 0. Ottenstein and Ottenstein were the first to 
define slicing as a graph reachability problem H . They used a Program Depen- 
dence Graph (PDG) for static slicing of single-procedure programs. The PDG 
of a program P is a directed graph whose vertices represent either assignment 
statements or control predicates, and edges represent either control dependence 
or flow dependence |J|- As originally introduced by Weiser, slicing (static slicing) 
considered all possible program executions. That is, static slices do not depend 
on the input data to a program. While debugging however, we typically deal 
with a particular incorrect execution and are interested in locating the cause 
of incorrectness in that execution. Therefore, we are interested in a slice that 
preserves the program behavior for a specific program input, rather than that 
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for all possible inputs. This type of slicing is referred to as dynamic slicing. Korel 
and Laski were the first to introduce the idea of dynamic slicing 0 . A dynamic 
slice contains all statements that actually affect the value of a variable at a pro- 
gram point for a particular execution of the program. Korel and Laski extended 
Weiser’s static slicing algorithm based on data-flow equations for the dynamic 
case |2|. Agrawal and Horgan were the first to propose a method to compute 
dynamic slices using PDG [6'. 

Present day software systems are becoming larger and complex and usually 
consist of concurrent processes. It is much more difficult to debug and understand 
the behavior of such concurrent programs than the sequential ones. Program slic- 
ing techniques promise to come in handy at this point. In this paper, we present a 
framework to compute slices of concurrent programs for specific program inputs 
by introducing the notion of Dynamic Program Dependence Graph ( DPDG ). 
To construct a DPDG, we proceed by first constructing a process graph and 
a Static Program Dependence Graph ( SPDG ) at compile time. Trace files are 
generated at run-time to record the information regarding the relevant events 
that occur during the execution of concurrent programs. Using this information 
stored in the trace files, the process graph is refined to realize a concurrency 
graph. The SPDG, the information stored in trace files, and the concurrency 
graph are then used to construct the DPDG. Once the DPDG is constructed, 
it is easy to compute slices of the concurrent program using a simple graph 
reachability algorithm. 

2 Static Graph Representation 

Before we can compute slices of a concurrent program, we need to construct a 
general model of the program containing all information necessary to construct 
a slice. We do this through three hierarchical levels. In the first stage we graph- 
ically represent static aspects of concurrent programs which can be extracted 
from the program code. We will enhance this representation later to construct a 
DPDG. In our subsequent discussions, we will use primitive constructs for pro- 
cess creation, interprocess communication and synchronization which are similar 
to those available in the Unix environment |2j. The main motivation behind our 
choice of Unix-like primitives is that the syntax and semantics of these primi- 
tive constructs are intuitive, well-understood, easily extensible to other parallel 
programming models and also can be easily tested. The language constructs 
that we consider for message passing are msgsend and msgrecv. The syntax and 
semantics of these two constructs are as follows: 

• msgsend(msgqueue, msg): When a msgsend statement is executed, the 
message msg is stored in the message queue msgqueue. The msgsend statement 
is nonblocking, i.e. the sending process continues its execution after depositing 
the message in the message queue. 

• msgrecv(msgqueue, msg): When a msgrecv statement is executed, the vari- 
able msg is assigned the value of the corresponding message from the message 
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queue msgqueue. The msgrecv statement is blocking, i.e. if the msgqueue is found 
to be empty, it waits for the corresponding sending process for depositing the 
message. 

We have considered nonblocking send and blocking receive semantics of inter- 
process communication because these have traditionally been used for concurrent 
programming applications. In this model, no assumptions are made regarding 
the order in which messages arrive in a message queue from the msgsencl state- 
ments belonging to different processes except that messages sent by one process 
to a message queue are stored in the same order in which they were sent by the 
process. 

A fork() call creates a new process called child which is an exact copy of the 
parent. It returns a nonzero value (process ID of the child process) to the parent 
process and zero to the child process fZj . Both the child and the parent have sep- 
arate copies of all variables. However, shared data segments acquired by using 
the shmget() and shmat() function calls are shared by the concerned processes. 
Parent and child processes execute concurrently. A waitQ call can be used by 
the parent process to wait for the termination of the child process. In this case, 
the parent process would not proceed until the child terminates. Semaphores are 
synchronization primitives which can be used to control access to shared vari- 
ables. In the Unix environment, semaphores are realized through the semgetQ 
call. The value of a semaphore can be set by semctlQ call. The increment and 
decrement operations on semaphores are carried out by the semopQ call t ZJ. 
However, for simplicity of notation, in the rest of the paper we shall use P(sem) 
and V (sem) as the semaphore decrement and increment operations respectively. 



2.1 Process Graph 

A process graph captures the basic process structure of a concurrent program. 
In this graph, we represent the process creation, termination, and joining of 
processes. More formally, a process graph is a 5-tuple ( En , T, N, E, C) where 
En is special entry node denoting start of the program. T is the set of terminal 
nodes. N is the set of non-terminal, non-entry nodes, E is the set of edges, and C 
is a function that assigns statements to edges in the process graph. A node from 
N represents any one of the following types of statements: a fork() statement 
(denoted as F), a waitQ statement (denoted as W) or a loop predicate whose 
loop body contains a fork() statement (denoted as L). Each individual terminal 
node from T will be denoted as t. Edges may be of three types: process edge, loop 
edge, and join edge. A process edge represents the sequence of statements starting 
from the statement represented by the source node of the edge till the statement 
represented by the sink node of the edge. Loop and join edges are dummy edges 
and do not represent any statement of the program but are included to represent 
the control flow. The source nodes of both these two types of edges are terminal 
nodes. The sink node of a loop edge is of L-type and that of a join edge is of 
W-type. Direction of an edge represents direction of control flow. We will use 
solid edges to denote process edges, dashed edges to denote loop edges, and 



18 



D. Goswami and R. Mall 




► Process Edge 

( SQ: Statement sequence ) Loop Edge 

► Join Edge 

(a) (b) 

Fig. 1. (a) An Example Concurrent Program (b) Its Process Graph 



dotted edges to represent join edges. Figure ^b) shows the process graph of the 
example concurrent program given in figure Qa). In the example of figured the 
labels of the statements indicate the type of nodes represented by the concerned 
statements. 

2.2 Static Program Dependence Graph 

A static program dependence graph ( SPDG ) represents the program depen- 
dences which can be determined statically. The SPDG is constructed by ex- 
tending the process graph discussed in the previous section. It can easily be 
observed that control dependences among statements are always fixed and do 
not vary with the choice of input values and hence can be determined statically 
at compile time. A major part of data dependence edges can also be determined 
statically, excepting the statements appearing under the scope of selection and 
loop constructs which have to be handled at run-time. To construct the SPDG 
of a program, we first construct a dummy node S for each edge whose source 
node is a fork node F to represent the beginning of the statement sequence rep- 
resented by that edge. Control dependence edges from node F to each node S 
is then constructed and the statements which were earlier assigned to the edges 
beginning with the fork node F are now assigned to the corresponding edges 
beginning with the dummy nodes S. 

Let W be a node in the process graph representing a wait call. Let m be the 
first node of type F or L or t found by traversing the process graph from node 
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W along the directed edges. All the statements which are represented by the 
edge from W to m are said to be controlled by the node W . 

We now construct a control dependence subgraph for each statement se- 
quence representing an edge (or a sequence of edges) beginning with node En 
or S and ending with a node F or a t in the process graph. Informally, in each 
of these control dependence subgraphs, a control dependence edge from a node 
£ to a node y exists, iff any one of the following holds: 

• x is the entry node (En) or a process start node ( S ) and y is a statement 
which is not nested within any loops or conditional statements and is also not 
controlled by a wait call. 

• x is a predicate and y is a statement which is immediately nested within 
this predicate. 

• x is a wait call and y is a statement which is controlled by this wait call 
and is not immediately nested within any loop or conditional statement. 

After construction of control dependence subgraph for each edge of the pro- 
cess graph, all data dependence edges between the nodes are then constructed. 
Data dependence edges can be classified into two type: deterministic and po- 
tential. A data dependence edge which can be determined statically is said to 
be deterministic and the one which might or might not exist depending on the 
exact execution path taken at run-time is said to be potential. We construct all 
potential data dependence edges at compile time but whether a potential data 
dependence edge is actually taken can be determined only based on run-time 
information. To mark an edge as a potential edge in the graph, we store this 
information at the sink node of the edge. 

3 Dynamic Program Dependence Graph 

In this section, we discuss how dynamic information can be represented in an 
enhanced graph when a concurrent program is executed. We call this enhanced 
graph with dynamic information, a Dynamic Program Dependence Graph 
(DPDG). A DPDG of a concurrent program is a directed graph. The nodes 
of DPDG represent the individual statements of the program. There is a special 
entry node and one or more terminal nodes. Further for every process there is a 
special start node as already discussed in the context of SPDG definition. The 
edges among the nodes may be of data/control dependence type or synchroniza- 
tion/communication dependence type. The dynamic data dependence edges are 
constructed by analyzing the run-time information available. Data dependences 
due to shared data accesses, synchronization due to semaphore operations, and 
communication due to message passing are considered and corresponding edge 
types are added in the graph for those instructions which are actually executed. 
After construction of the DPDG , we apply a reachability criterion to compute 
dynamic slices of the program. To construct DPDG, it is necessary to first 
construct a concurrency graph to determine the concurrent components in the 
graph. 
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3.1 Concurrency Graph 

A concurrency graph is a refinement of a process graph and is built using the 
program’s run-time information. A concurrency graph is used to perform concur- 
rency analysis which is necessary to resolve shared dependences existing across 
process boundaries. A concurrency graph retains all the nodes and edges of a 
process graph. Besides these, it contains two new types of edges: synchronization 
edges and communication edges. It also contains a new type of node called a syn- 
chronization node. The purpose and construction procedure for these new edges 
and nodes are explained in the following subsections. The concurrency graph 
also represents the processes which get dynamically created during run-time. 



3.1.1 Communication Edge 

The execution of a msgrecv statement depends on the execution of a corre- 
sponding msgsend statement. This dependence is referred as communication de- 
pendence. A msgrecv statement s r is communication dependent on a msgsend 
statement s s , if communication occurs between s r and s s during execution. Com- 
munication dependence cannot be determined statically because messages sent 
from different processes to a message queue may arrive in any order. So, com- 
munication dependences can only be determined using run-time information. 
The information needed to construct a communication edge from a send node 
x in a process P to a receive node y in a process Q is the pair (P, x) stored at 
node y. To be able to construct the communication edges, the sending process 
P needs to append the pair ( sending process ID, sending statement no.) to the 
sent message. After receiving the message, the receiving process Q needs to store 
this information in a trace file. From this information, the communication edges 
can immediately be established. Each message passing statement (both send and 
recieve) which gets executed is represented in the concurrency graph as a node 
and this node is referred as a synchronization node , Sn. 



3.1.2 Synchronization Edge 

Semaphore operations (P and V) are used for controlling accesses to shared 
variables by either acquiring resources (through a P operation) or releasing re- 
sources (through a V operation) . We construct a synchronization edge from the 
node representing each V operation to the node representing the corresponding 
P operation on the same semaphore. A synchronization edge depicts which P 
operation depends on which V operation. We define a source node for the V oper- 
ation and a sink node for the corresponding P operation. Identification of a pair 
of related semaphore operations can be done by matching the nth V operation 
to the (n + «)th P operation on the same semaphore variable, i being the initial 
value of the semaphore variable. To record the information required to deter- 
mine this semaphore pairing, each semaphore operation records (for the process 
it belongs to) the number of operations on the given semaphore which have al- 
ready occurred. From these recorded information, the semaphore operations can 
easily be paired and synchronization edges can then be constructed from these 
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pairings. Each semaphore operation which gets executed is represented as a node 
in the concurrency graph, and will be referred as a synchronization node , Sn. 



3.1.3 Handling Dynamic Creation of Processes 

Parallel and distributed programs may dynamically create processes during run- 
time. If a fork call is nested within a loop or conditional, it is difficult to determine 
during compile time, the exact number of fork calls that would be made. This 
information can only be obtained at run-time. For example, if the loop predicate 
L of the process graph shown in figure Efa) gets executed for zero, one or two 
times, the number of processes created varies and the corresponding concurrency 
graphs for these new situations are depicted in figure 0(b) , 0c) , ancl|2Id). The 
concurrency graph incorporates all these dynamically created processes from the 
information stored in the trace files. 




Fig. 2. (a) A Sample Process Graph. The Corresponding Concurrency Graphs which 
Result After Execution of the Loop for (b) Zero, (c) One, and (d) Two Times (assuming 
that there exists no interprocess communication among processes) 



3.2 Code Instrumentation for Recording Events 

For construction of the DPDG , we have to record the necessary information 
related to process creation, interprocess communication, and synchronization 
aspects of concurrent programs at run-time. When a program is executed, a trace 
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file for each process is generated to record the event history for that process. We 
have identified the following types of events to be recorded in the trace file along 
with all relevant information . The source code would have to be instrumented 
to record these events. 

• Process creation: Whenever a process gets created it has to be recorded. 
The information recorded are the process ID and parent process’s ID. A process 
may create many child processes by executing fork calls. To determine the parent 
process for any child process the parent process would record (after execution of 
the fork call) the node (statement) number of the first statement to be executed 
by the child process after its creation. 

• Shared data access: In case of access to shared data by a process, the 
information required to be recorded are the variable accessed and whether it is 
being defined or used. 

• Semaphore operation: For each semaphore 

operation (P or V ), semaphore ID, and the value of a counter associated with 
the semaphore indicating the number of synchronization operations performed 
on that semaphore so far. 

• Message passing: In case of interprocess communication by message pass- 
ing, the recorded information should include the type of the message (e.g. send/ 
receive) and the pair (sending_processJD, sending_node_no.) in case of a receive 
type. For this, the process executing a msgsend statement would have to ap- 
pend the above mentioned pair with the message. Whenever a process executes 
a msgrecv statement, it would extract the pair and store it in the trace file. 

The concurrency graph of a concurrent program is constructed from the 
execution trace of each process recorded at run-time. An execution trace, TX of 
a program is a sequence of nodes (statements) that has actually been executed 
in that order for some input. Node Y at position p is written as Y p or TX(p) 
and is referred to as an event. By v q we denote variable v at position q. A use 
of variable v is an event denoting that this variable is referenced. A definition of 
variable v is an event denoting an assignment of a value to that variable. D(Y P ) 
is a set of variables whose values are defined in event Y p . By the term most recent 
definition MR[v k ) of variable v k in TX we mean Y p such that v G D{Y P ) and 
v is not defined in any of the events after Y p upto position k. 

We now discuss how to handle dynamic process creation in the process graph 
and the corresponding control and data dependence subgraphs in the SPDG for 
later use. Let (a;, y) and (y, z) be two directed edges. We define transformation by 
doing fusion on node y as the replacement of these two edges by the single edge 
(x,z). Then, by doing a fusion on synchronization nodes we get the dynamic 
process graph i.e., the process graph which contains all the dynamically created 
processes. The SPDG can then be modified to incorporate the control and data 
dependence subgraphs for these dynamically created processes. From this, we get 
a mapping of every event in the execution trace to the nodes of SPDG. More 
than one event may map to a single node in the SPDG , if the event corresponds 
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to a loop predicate or a statement within a loop which gets executed more than 
once. 

In a concurrent or distributed environment, different processes might write 
the significant events to their individual trace files without using any globally 
synchronized timestamp. Therefore, it is necessary to order the recorded events 
properly for use in constructing the DPDG. In particular, ordering of events is 
necessary to construct shared dependence edges among nodes accessing shared 
variables. 



3.3 Event Ordering 

The structure of the concurrency graph and the information stored in the event 
history (i.e. execution trace) help us to order the events. Please note that we will 
be dealing with partial ordering of events only jSj. This type of ordering with 
respect to a logical clock is sufficient to determine the causal relationships be- 
tween the events. We can partially order the nodes and edges of the concurrency 
graph by using the happened-before relation — > pj as follows: 

• Let x and y be any two nodes of the concurrency graph, x — > y is true, if 
y is reachable from x following any sequence of edges in the concurrency graph. 

• Let be the edge from node X\ to node X2 and ej be the edge from node 
2/i to node 2/2; e* — > ej is true, if X2 —> yi is true. 

If two edges ej and e 3 cannot be ordered using the happened-before rela- 
tionship, then these two edges are said to be incomparable . This is expressed as 
ej||ej. Let SV denotes a shared variable. Then, SVi indicates an access of the 
shared variable SV in the edge i; SVi r and SVi W denote respectively a read and 
a write operation respectively on the shared variable SV in the edge i. Data 
race leads to ambiguous and erroneous conditions in programming. A concur- 
rent program is said to be data race free, if every pair of incomparable edges 
in the concurrency graph is data race free. The condition given below leads to 
data race in a concurrent program. The condition states that if a shared variable 
SV is accessed in two edges which are incomparable and if at least one of these 
accesses is a write, then this leads to a data race. 

({SV ir A SVj W ) V (SV iw A SVj W )) A (e* [ [e^ ) for any i, j. 

It is reasonable to assume that the program to be sliced is data race free. If 
the program under consideration is not data race free, then this can be identified 
during the process of event ordering and can be reported to the user. 



3.4 Construction of DPDG 

Once the concurrency graph is constructed, the DPDG for the concurrent pro- 
gram can be constructed from the SPDG and the concurrency graph. For every 
event in the execution trace, a corresponding node in the DPDG is created. 
All the data and control dependence edges to the corresponding nodes from this 
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node are then constructed in the DPDG. If there exists any potential data de- 
pendence edge, only the edge which has actually been taken during execution is 
created by finding out the most recent definition of the variable from the trace 
file. Data dependence edges may cross the process boundaries when a statement 
in a process accesses a shared data item defined by a statement in some other 
process. Shared variable accesses can either be synchronized or unsynchronized. 

• Synchronized access: In case of synchronized access to shared resources 
using semaphores, shared dependence edges are constructed by referring to the 
concurrency graph. Consider an event SVi r within a (P, T^-block, (Pi, Vi). The 
concurrency graph is then traversed backwards from synchronization node Pi 
along synchronization edges to reach the corresponding source synchronization 
node Vj. A shared dependency edge from the event SVkw within the (P, V)-block, 
(. Pj , Vj) to the event SVi r is then constructed. 

• Unsynchronized access: When a statement in a process uses a shared vari- 
able, we need to identify the statement which modified this shared variable most 
recently to establish the shared dependence. The task is to trace the event history 
searching for the event (s) modifying this variable. If more than one modification 
to the same variable occurred, then these are ordered by referring to the concur- 
rency graph. A data dependence edge is then constructed from the statement 
which performed the most recent modification to this variable to the statement 
reading it. Consider the unsynchronized event SV lr . Let the events SVj- w , SV mw , 
and SV tw represent the modifications to the same variable. Let the concurrency 
graph reveal that e*, — > e m , e m — > e*, and et — > e,;. In this case, we need to 
construct a shared dependence edge from SV tw to SVi r . 

Communication edges are established from the pairs (sending process ID, 
sending node no.) which are stored in the trace file when msgrecv statements 
get executed. 

4 Related Work 

Cheng proposed a representation for concurrent programs where he generalized 
the notions of CFG and PDG p). Cheng’s algorithm for computing dynamic 
slices is basically a generalization of the initial approach taken by Agrawal and 
Horgan |E|, which computes a dynamic slice using a static graph. Therefore, their 
slicing algorithm may compute inaccurate slices in the presence of loops. Miller 
and Choi use a dynamic dependence graph, similar to ours to perform flow-back 
analysis in their parallel program debugger m- Our method, however, differs 
from theirs in the way the intermediate graphs are constructed. Our graph repre- 
sentation is substantially different from theirs to take care of dynamically created 
processes and message passing using message queues. In their approach, a branch 
dependence graph is constructed statically. In the branch dependence graph data 
dependence edges for individual basic blocks are included. Control dependence 
edges are included during execution and the dynamic dependence graph is built 
by combining in order the data dependence graphs of all basic blocks reached 
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during execution. We find that it is possible to resolve the control dependences 
and a major part of inter-block data dependences statically and we take ad- 
vantage of this. The other differences are that we construct the DPDG in a 
hierarchical manner and consider the Unix primitives for process creation and 
interprocess communication. Another significant difference arises because of our 
choice of Unix primitives for message passing. In this model messages get stored 
in message queues and are later retrieved from the queue by the receiving pro- 
cess. This is a more elaborate message passing mechanism. Duesterwald et al. 
represent distributed programs using a Distributed Dependence Graph ( DDG ) . 
Run-time behavior of a program is analyzed to add data and communication 
dependence features to DDG. They have not considered interprocess communi- 
cation using shared variables. DDG constructs a single vertex for each statement 
and control predicate in the program PH- Since it uses a single vertex for all oc- 
currences of a statement, the slice computed would be inaccurate if the program 
contains loops. 



5 Conclusion 

We have proposed a hierarchical graph representation of concurrent programs 
that lets us efficiently compute dynamic slices. Our method can handle both 
shared memory and message passing constructs. Unix semantics of message pass- 
ing where messages get stored in a message queue in a partially ordered manner 
introduces complications. We have proposed a solution to this by attaching the 
source process ID and the statement number of the msgsnd statement along- 
with the message. Since we create a node in DPDG for each occurrence of a 
statement in the execution trace, the resulting slices are more precise for pro- 
grams containing loops compared to traditional dynamic slicing algorithms such 
as those by Duesterwald et al PD and Cheng’s [Ej which use a single vertex for 
all occurrences of a statement. 
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Abstract. High performance computing capability is crucial for the advanced 
calculations of scientific applications. A parallelizing compiler can take a 
sequential program as input and automatically translate it into a parallel form. 
But for loops with arrays of irregular (i.e„ indirectly indexed), nonlinear or 
dynamic access patterns, no state-of-the-art compilers can determine their 
parallelism at compile-time. In this paper, we propose an efficient run-time 
scheme to compute a high parallelism execution schedule for those loops. This 
new scheme first constructs a predecessor iteration table in inspector phase, and 
then schedules the whole loop iterations into wavefronts for parallel execution. 
For non-uniform access patterns, the performance of the inspector/executor 
methods usually degrades dramatically, but it is not valid for our scheme. 
Furthermore, this scheme is especially suitable for multiprocessor systems 
because of the features of high scalability and low overhead. 



1 Introduction 

Recently, automatic parallelization is a key enabling technique for parallel computing. 
How to exploit the parallelism in a loop, or the loop parallelization, is an important 
issue in this area. Current parallelizing compilers demonstrate their effectiveness for 
loops that have no cross-iteration dependences or have only uniform dependences. 
But there are some limitations in the parallelization of loops with complex or 
statically insufficiently defined access patterns. 

In order to convert sequential programs into their parallel equivalents, parallelizing 
compilers must perform data dependence analysis first to determine whether a loop, 
or part of it, can be executed in parallel without violating the original semantics. This 
analysis is mainly focused on array subscript expressions, i.e., array access patterns. 
Basically, the application programs can be classified into two types: 

1. Regular programs, in which memory accesses are described by linear 
equations of variables (usually loop index variables). 

2. Irregular programs, in which memory accesses are described by indirection 
mapping (e.g., index arrays) or computation dependent. 

Regular programs are much easier to deal with because they can be statically 
analyzed at compile-time. Unfortunately, many scientific programs performing 
complex modeling or simulations, such as DYNA-3D and SPICE, are usually 
irregular programs. The form of irregular accesses looks like A(w(i)) or A(idx), where 
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w is an index array and idx is computed inside the loop but not an induction variable. 
In the circumstances, we can only resort to run-time parallelization techniques as 
complementary solutions because information for analysis is not available until 
program execution. 



2 Related Work 

Two different approaches have been developed for run-time loop parallelization: the 
speculative doall execution and the inspector/executor parallelization method. The 
former approach assumes the loop to be fully parallelizable and executes it 
speculatively, then examines the correctness of parallel execution after loop 
termination. In the latter approach, the inspector examines cross-iteration 
dependences and produces a parallel execution schedule first, then the executor 
performs actual loop operations based on the schedule arranged by the inspector. 



Speculative Doall Parallelization. The speculative doall parallelization speculatively 
executes the loop operations in parallel, accompanied with a marking mechanism to 
track the accesses to the target arrays. After loop termination, an analysis mechanism 
is applied to examine whether this speculative doall parallelization passes or not, i.e., 
to check whether no cross-iteration dependence occurs in the loop. If it passes, a 
significant speedup will be obtained. Otherwise, the altered variables should be 
restored and the loop is re-executed serially. The reader who is interested in this topic 
may refer to Huang and Hsu [2], and Rauchwerger and Padua [9]. 

Run-Time Doacross Parallelization (The Inspector/Executor Method). According 
to the scheduling unit, we classify the inspector/executor methods into two types: the 
reference-level and the iteration-level. The reference-level type assumes a memory 
reference of the loop body as the basic unit of scheduling and synchronization in the 
inspector. Busy-waits are used to ensure values are produced before used during the 
executor phase. This type of method has the advantage of increasing the overlap of 
dependent iterations, but at the expense of more synchronization overhead. The reader 
who is interested in this topic may see Chen et al. [1], and Xu and Chaudhary [1 1]. 

The iteration-level type assumes loop iteration as the basic scheduling unit in the 
inspector. The inspector schedules the source loop iterations into appropriate 
wavefronts at run-time; wavefronts will be executed serially but the iterations in a 
wavefront are executed in parallel. The executor then performs actual execution 
according to the wavefront sequence. The reader who is interested in this topic may 
see Zhu and Yew [12], Midkiff and Padua [6], Polychronopoulos [7], Saltz et al. [10], 
Leung and Zahorjan [4], Leung and Zahorjan [5], Rauchwerger et al. [8], and Huang 
et al. [3], 

In general, speculative doall execution gains significant speedup if the target loop 
is intrinsically fully parallel; otherwise a hazard arises when cross-iteration 
dependences occur [2, 9]. In contrast, inspector/executor methods are profitable in 
extracting doacross loop parallelism, but may suffer a relative amount of processing 
overhead and synchronization burden [1, 3-8, 10-12]. 
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3 Our Efficient Run-Time Scheme 

In retrospect of the development of run-time doacross parallelization, we find some 
inefficient factors as follows: 

• sequential inspector, 

• synchronization overhead on updating shared variables, 

• overhead in constructing dependence chains for all array elements, 

• large memory space in operation, 

• inefficient scheduler, 

• possible load migration from inspector to executor. 

In this paper, we propose an efficient run-time scheme to overcome these 
problems. Our new scheme constructs an immediate predecessor table first, and then 
schedules the whole loop iterations efficiently into wavefronts for parallel execution. 
Owing to the characteristics of high scalability and low overhead, our scheme is 
especially suitable for multiprocessor systems. 



3.1 Design Considerations 

As described in the previous section, the benefits coming from run-time 
parallelization may be offset by a relative amount of processing overhead. 
Consequently, how to reduce the processing overhead is a major concern of our 
method, and parallel inspectors, which can exploit the capability of multiprocessing, 
becomes the objective that we pursue. 

Rauchwerger et al. | 8 1 designed a run-time parallelization method, which is fully 
parallel, with no synchronization, and can be applied on any kind of loop. This 
scheme is based on the operation of predecessor references; we call it PRS 
(Predecessor_Reference_Scheme) for short. Their inspector encodes the 
predecessor/successor information, for the references to A(x), in a reference array R x 
and a hierarchy vector H x so that the scheduler can arrange loop iterations according 
to this dependence information. The scheduler is easy to implement but not efficient 
enough. In order to identify the iterations belonging to the ;'th wavefront, all the 
references must be examined to determine the ready states of corresponding 
unscheduled iterations in the ith step. Thus, the scheduler takes 

0((numref/numproc)*cpl ) time for this processing, where numref is the number of 
references, numproc is the number of processors, and cpl is the length of the critical 
path in the directed acyclic graph that describes the cross-iteration dependency in the 
loop. In fact, the above processing overhead can be completely eliminated if we use 
the data representation of predecessor iteration instead of predecessor reference. 

Hereby, we develop a new run-time scheme to automatically extract the parallelism 
of doacross loops under the following considerations: 

1. Devising a high efficient parallel inspector to construct a predecessor iteration 
table for recording the information of iteration dependences. 

2. Devising a high efficient parallel scheduler to quickly produce the wavefront 
schedule with the help of predecessor iteration table. 

3. The scheme should be no synchronization to ensure good scalability. 
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3.2 An Example 

We demonstrate the operation of our run-time scheme by an example. Fig. 1(a) is an 
irregular loop to be parallelized. The access pattern of the loop is shown in Fig. 1(b). 
In our scheme, a predecessor iteration table, with the help of an auxiliary array 
/«( which will be explained in the next subsection), is constructed first as shown in 
Fig. 1(c) at run-time. Then, the wavefronts of loop iterations can be quickly scheduled 
out there in Fig. 1(d). The predecessor iteration table is implemented by two arrays, 
pw( 1 mumiter) and pr(l:numiter), where numiter is the number of iterations of the 
target loop. Element pw(i) records the predecessor iteration of iteration i that 
accesses(either read or write ) the same array element x(w(i)). A similar explanation 
applies to pr(i). After the predecessor iteration table is built, the scheduler uses this 
information to arrange iterations into wavefronts: firstly, the iterations with no 
predecessor iteration (i.e. whose associated elements in pw and pr are both zero) are 
scheduled into the first wavefront, then the iterations whose predecessor iterations 
have been scheduled are assigned into the next wavefront. The procedure is repeated 
until all the iterations are scheduled. 



do i = 1, numiter 
x(w(i))= ... 
y(i) = x(r(i))... 



w(l:12)=[3 4115281857 2] 
r(l:12)= [5 6137243878 1] 



end do 

(a) An irregular loop (b) Synthetic access pattern 
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(c) Predecessor iteration table 
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(d) Wavefront schedule 

Fig. 1. An example to demonstrate the operations of our scheme 



3.3 The Inspector 

The goal of our inspector is to construct a predecessor iteration table in parallel. The 
iterations of target loop are distributed in block faction into processors; each 
processor takes charge of blksize contiguous iterations. We use an auxiliary array 
la(l:numproc,l:arysize), which is initially set to zero, to keep the latest iteration, all 
the way, that access the array element for each processor. Namely, la(p,j) records the 
latest iteration getting access to the array element j for processor p (la is the 
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abbreviation of latest access). The algorithm of our parallel inspector is shown in Fig. 
2, which consists of two phases: 

1. Parallel recording phase (lines 3 to 10). 

Step 1 . If la(p,w(i)) or la(p,r(i)) is not zero, let them be c and d respectively, 
then record pw(i) as c and pr(i) as d, meaning that c is the 
predecessor iteration of iteration i for accessing the array element 
w(i) and d is the predecessor iteration of iteration i for accessing the 
array element r(i). The arrays pw and pr are initially set to zero. 

Step 2. Set la(p,w(i)) and la(p,r(i)) to current iteration i. This means that, for 
processor p, iteration i is the latest iteration that access the array 
elements w(i) and r(i). 

2. Parallel patching phase (lines 11 to 24). For each iteration i, if pw(i)=0 then 
find the largest-numbered processor q, where q < the current processor p, 
such that la(q,w(i)) is not zero. Assume now that la(q,w(i))=j, then set 
pw(i)=j. This means that j is the real predecessor iteration of iteration i in 
accessing the array element w(i). In the same way, if pr(i)=0 we find the 
largest-numbered processor t < the current processor p such that 
la(t,r(i))=k! 0, and set pr(i)=k to mean that, in reality, k is the predecessor 
iteration of iteration i in accessing the array element r(i). 

For the example in previous subsection, we assume that there are two processors. 
By block distribution, iterations 1 to 6 are charged by processor 1 and iterations 7 to 
12 by processor 2. The contents of the auxiliary array la and the predecessor iteration 
table pw and pr for processor 1 and processor 2 are shown in Fig. 3(a) and Fig. 3(b), 
respectively, when the parallel recording phase is in progress. Remember that 
processor 1 is in charge of iterations 1 to 6 and processor 2 in charge of iterations 7 to 
12. For easier reference, the access pattern in Section 3.2 is also included in Fig. 3. 
We can see that, when processorl is dealing with iteration 4 it will write array 
element 1 (since w(4)=l) and read array element 3 (since r(4)=3). But array element 1 
has been accessed (both read and written in this example) by iteration 3 (this can be 
seen from la(l,l)=3), and array element 3 has been accessed (written in this example) 
by iteration 1 (since la(l,3)=l). Therefore, the predecessor iterations of iteration 4 are 
iteration 3 for writing array element 1, and iteration 1 for reading array element 3. The 
predecessor iteration table is hence recorded as pw(4)=3 and pr(4)=l. This behavior is 
encoded in lines 5 and 6 of our inspector algorithm in Fig. 2. After this, la( 1,1) will be 
updated from 3 to 4 to mean that iteration 4 is now the latest iteration of accessing the 
array element 1, and la( 1 ,3) will be updated from 1 to 4 to mean that iteration 4 is 
now the latest iteration of accessing the array element 3. This behavior is coded in 
lines 7 and 8 in Fig. 2. Since the values in array la after iteration i is processed might 
be changed when iteration i+1 is dealt with, for clear demonstration, the new updated 
values are bold-faced in Fig. 3(a). 

In the parallel patching phase, let us see, in Fig. 3(b), why pw(7) remains 0 
(unchanged) and how pr(7) is changed to 2 by processor 2. Since pw(7) in Fig. 3(a) is 
zero, this means no prior iteration in processor 2 accesses the same array element with 
iteration 7. But from w(7)=8, we know that iteration 7 writes array element 8. Hence, 
we have to trace back to processor 1 to check whether it has accessed this element. By 
checking la(l,8)=0, we find that no iteration in processor 1 has ever 
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/* The construction of predecessor iteration table */ 

1 pr ( 1 : numiter ) =0 

2 pw ( 1 : numiter ) =0 

/* Parallel recording phase */ 

3 doall p=l,numproc 

4 do i= (p- 1 ) * (numiter/numproc) +1 , p* (numiter/numproc ) 

5 if (la (p, w (i) ) . ne . 0 ) then pw (i) =la (p, w (i) ) 

6 if (la (p, r (i) ) . ne . 0 ) then pr (i) =la (p, r (i) ) 

7 la (p, w(i) ) =i 

8 la (p, r (i) ) =i 

9 enddo 

10 enddoall 

/* Parallel patching phase */ 

11 doall p=2,numproc 

12 do i= (p- 1 )* (numiter/numproc ) +1 , p* (numiter/numproc ) 

13 if (pw(i) .eq.O) then 

14 do j =p- 1 , 1 , - 1 

15 if (la ( j , w (i) ) .ne . 0) then 

16 pw (i) =la ( j , w ( i) ) 

17 goto SI 

18 endif 

1 9 enddo 

20 si: endif 

21 if (pr (i) . eq. 0 ) then 

22 do j =p-l , 1 , -1 

23 if (la( j ,r (i) ) .ne. 0) then 

24 pr (i) =la ( j , r (i) ) 

25 goto S2 

26 endif 

2 7 enddo 

28 S2 : endif 

2 9 enddo 
30 enddoall 



Fig. 2. The algorithm of inspector 



accessed array element 8. Therefore, pw(7) remains unchanged. In the same 
argument, pr(7) in Fig. 3(a) is zero, which means no prior iteration in processor 2 
accesses the same array element with iteration 7. But r(7)=4 indicates that iteration 7 
reads array element 4. By tracing back to processor 1, we find la(l,4)=2. This means 
that, in processor 1, iteration 2 also accesses array element 4. Hence, the predecessor 
iteration of iteration 7 should be 2 for read access and pr(7) is changed to 2. The 
behavior is coded in lines 13 to 20 and lines 21 to 28 for write access and read access 
respectively. The predecessor iterations that have been changed are also bold-faced in 
Fig. 3(b) for illustration. As for the time complexity, it is easy to see that our parallel 
inspector takes O(numiter) time. 
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(b) 



Fig. 3. (a) The contents of array la and the predecessor iteration table for processor 1 and 
processor 2 when the parallel recording phase is in progress, (b) The predecessor iteration table 
after parallel patching phase is finished 



3.4 The Scheduler 

The algorithm of our parallel scheduler is presented in Fig. 4. Scheduling the loop 
iterations into wavefronts becomes very easy once the predecessor iteration table is 
available. In the beginning, the wavefront table wf(l:numiter) is set to zero to indicate 
that all loop iterations have not been scheduled. 

Like our parallel inspector, the loop iterations are distributed into processors in 
block faction. For each wavefront number, all processors simultaneously examine the 
iterations they are charged in series (lines 8 and 9) to see if they can be assigned into 
the current wavefront. Only the iterations that have not been scheduled yet (line 10) 
and whose predecessor iterations, for both write and read access, have been scheduled 
can be arranged into the current wavefront (lines 11 and 12). This procedure repeats 
until all loop iterations have been scheduled (line 5). The maximum wavefront 
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number will be referred as cpl (critical path length) because it is the length of critical 
path in the directed acyclic graph representing the cross-iteration dependency in the 
target loop. 

/* Schedule iterations into wavefronts in parallel */ 

1 wf ( 1 : numiter) =0 

2 wf ( 0 ) =1 

3 done=. false. 

4 wfnum=0 

/* Repeated until all iterations are scheduled */ 

5 do while (done . eq .. false . ) 

6 done=.true. 

7 wf num=wf num+1 

8 doall p=l,numproc 

9 do i= (p- 

1)* (numiter/numproc ) +1 , p* (numiter/numproc ) 

10 if (wf(i).eq.O) then 

11 if (wf (pw ( i ) ) . ne . 0 . and . wf (pr (i) ) . ne . 0 ) then 

12 wf (i) =wfnum 

13 else 

14 done=. false. 

15 endif 

16 endif 

17 enddo 

18 enddoall 

19 enddo 



Fig. 4. The algorithm of the scheduler 



4 Experimental Results 

We performed our experiments on ALR Quad6, a shared-memory multiprocessor 
machine with four Pentium Pro 200MHz processors and 128MB global memory. The 
synthetic loop in Fig. 1(a) was added with run-time procedures and OpenMP 
parallelization directives, and then compiled using the pgf77 compiler. 

We intended to evaluate the impact of parallelism degree on the performance of 
two run-time schemes: PRS (Predecessor_Reference_Scheme) by Rauchwerger et 
al. [8], and PIS (Predecessor_Iteration_Scheme) by us. Let the grain size (workload) 
be 200us, the array size be 2048, and the iteration number vary from 2048 to 65536. 
The access patterns are generated using a probabilistic method. Since the accessed 
area (array size) is fixed to 2048, a loop with 65536 iterations will be more serial in 
comparison to a loop with 2048 iterations. Table 1 shows the execution time 
measured in each run-time phase. 

Fig. 5 is a speedup comparison for two run-time schemes. We can see that PIS 
always obtains a higher speedup and still has a satisfactory speedup of 2.8 in the 
worst case. The overhead comparison in Fig. 6 shows that the processing overhead 
((inspector time + scheduler time) / sequential loop time) of PRS is dramatically 
larger than that of PIS. This is because the scheduler of PRS examines all the 
references repeatedly as mentioned in Section 3.1. 
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Table 1. Execution time measured on two run-time schemes 
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Fig. 5. Speedup comparison 



Fig. 6. Overhead comparison 



5 Conclusion 

In the parallelization of partially parallel loops, the biggest challenge is to find a 
parallel execution schedule that can fully extract the potential parallelism but incurs 
run-time overhead as little as possible. In this paper, we present a new run-time 
parallelization scheme PIS, which use the information of predecessor/successor 
iterations instead of predecessor references to eliminate the processing overhead of 
repeatedly examining all the references in PRS. The predecessor iteration table can be 
constructed in parallel with no synchronization and our parallel scheduler can quickly 
produce the wavefront schedule with its help. From either theoretical time analysis or 
experimental results, our run-time scheme reveals better speedup and less processing 
overhead than PRS. 
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Abstract. Heterogeneous computing environments have become attractive 
platforms to schedule computationally intensive jobs. We consider the problem 
of mapping independent tasks onto machines in a heterogeneous environment 
where expected execution time of each task on each machine is known. 
Although this problem has been much studied in the past, we derive new 
insights into the effectiveness of different mapping heuristics by use of two 
metrics - efficacy (E) and utilization (U). Whereas there is no consistent rank 
ordering of the various previously proposed mapping heuristics on the basis of 
total task completion time, we find a very consistent rank ordering of the 
mapping schemes with respect to the new metrics. Minimization of total 
completion time requires maximization of the product ExU. Using the insights 
provided by the metrics, we develop a new matching heuristic that produces 
high-quality mappings using much less time than the most effective previously 
proposed schemes. 



Keywords: Heterogeneous Computing, Cluster Computing, Scheduling for 
Heterogeneous Systems, Task Assignment, Mapping Heuristics, Performance 
Evaluation 



1 Introduction 

The steady decrease in cost and increase in performance of commodity workstations 
and personal computers have made it increasingly attractive to use clusters of such 
systems as compute servers instead of high-end parallel supercomputers [3,4,7], For 
example, the Ohio Supercomputer Center has recently deployed a cluster comprising 
of 128 Pentium processors to serve the high-end computing needs of its customers. 
Due to the rapid advance in performance of commodity computers, when such 
clusters are upgraded by addition of nodes, they become heterogeneous. Thus, many 
organizations today have large and heterogeneous collections of networked 
computers. The issue of effective scheduling of tasks onto such heterogeneous 
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clustered systems is therefore of great interest [1,18,22], and several recent research 
studies have addressed this problem [2,6,7,12,13,14,15,16,19,20,21], 

Although a number of studies have focused on the problem of mapping tasks onto 
heterogeneous systems, there is still an incomplete understanding of the relative 
effectiveness of different mapping heuristics. In Table 1, we show results from a 
simulation study comparing four mapping heuristics that have previously been studied 
by other researchers [2,6,7], Among these four heuristics, the Min-Min heuristic has 
been reported to be superior to the others [6] on the basis of simulation studies, using 
randomly generated matrices to characterize the expected completion times of tasks 
on the machines. However, when we carried out simulations, varying the total number 
of tasks to be mapped, we discovered that the Min-Min scheme was not consistently 
superior. When the average number of tasks-per-processor was small, one of the 
schemes (Max-Min) that performed poorly for a high task-per-processor ratio, was 
generally the best scheme. For intermediate values of the task-per-processor ratio, yet 
another of the previously studied schemes (Fast Greedy) was sometimes the best. The 
following table shows the completion time (makespan, as defined later in the paper) 
for the mapping produced by the four heuristics (they are all explained later in Sec. 2 
of this paper). 

Table 1. The best mapping scheme depends on the average number of tasks per processor 





Makespan 


Heuristic 


16 tasks 


32 tasks 


64 tasks 


256 tasks 


Max-Min 


381.8 


719.8 


1519.6 


7013.6 


Fast Greedy 


414.7 


680.8 


1213.4 


4427.1 


Min-Min 


431.8 


700.4 


1146.3 


3780.6 


UDA 


496.3 


832.1 


1424.8 


4570.3 



Our observations prompted us to probe further into the characteristics of the mappings 
produced by the different schemes. In a heterogeneous environment, the total 
completion time for a set of tasks depends on two fundamental factors: 

1) The utilization of the system, which is maximized by balancing the, and 

2) The extent to which tasks are mapped to machines that are the most effective in 
executing them. 

These two factors are generally difficult to simultaneously optimize. For instance, if 
there is a particular fast processor on which all tasks have the lowest completion time, 
it would be impossible for a mapping to both maximize “effectiveness” and achieve 
high utilization of the entire system. This is because maximizing execution 
effectiveness would require mapping all tasks to the fast processor, leaving all others 
idle. 

In this study, we use two performance metrics (defined later in Sec. 3) - efficacy 
(E) and utilization (U), to develop a better understanding of the characteristics of 
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different mapping schemes. We first characterize several previously proposed static 
mapping heuristics using these metrics and show that some of them score highly on 
one metric while others score highly on the other metric. But it is the product ExU 
that determines the makespan. Using the insights provided by use of these metrics, we 
develop a new heuristic for task mapping that is both effective and computationally 
very efficient. Over a wide range of machine/task parameters, it provides consistently 
superior mappings (with improvement in makespan of up to 20%), and takes 
considerably less time to generate the mappings than the Min-Min heuristic. 

The rest of the paper is organized as follows. Section 2 provides necessary 
definitions. The performance metrics used are defined in Section 3. Enhancements to 
an existing matching heuristics are proposed in Sections 4. Experimental results are 
discussed in Section 5. Section 6 provides conclusions. 



2 Background and Notations 

In this section, we provide some background on previously proposed task matching 
heuristics and explain the notation used. We use the same notation as in [6]. A set of t 
independent tasks is to be mapped onto a system of m machines, with the objective of 
minimizing the total completion time of the tasks. It is assumed that the execution 
time for each task on each machine is known prior to execution [3,6,8,9,17] and 
contained within an ETC (Expected Time to Compute) matrix. Each row of the ETC 
matrix contains the estimated execution times for a given task on each machine. 
Similarly, each column of the ETC matrix consists of the estimated execution times 
for each task on a given machine. Thus, ETC[i,j] is the estimated execution time for 
task i on machine j. The machine availability time for machine j, MAT[j], is the 
earliest time a machine j can complete the execution of all tasks that have previously 
been assigned to it. The completion time, CT[i,j], is the sum of machine j’s 
availability time prior to assignment of task i and the execution time of task i on 
machine j, i.e. CT[i,j] = MAT[j] + ETC[i,j]. The performance criterion usually used 
for comparison of the heuristics is the maximum value of CT[i,j], for 0 < i < t and 0 < 
j < m, also called the makespan[6\. The goal of the matching heuristics is to minimize 
the makespan, i.e. completion time of the entire set of tasks. 

A number of heuristics have been proposed in previous studies. However, in this 
paper we only consider a subset of the heuristics reported, that run within a few 
seconds on a Pentium PC for mapping several hundred tasks. Below, we briefly 
describe the heuristics considered. 

UDA: User-Direct Assignment assigns each task to a machine with the lowest 
execution time for that task [3,6]. Its computational complexity is O(N), where N is 
the number of tasks being mapped. 

Fast Greedy: The Fast Greedy heuristic assigns tasks in arrival order, with each task 
being assigned to the machine which would result in the minimum completion time 
for that task [3,6], based on the partial assignment of previous tasks. Its computational 
complexity is also O(N). 

Min-Min: The Min-Min heuristic begins with the set U of all unmapped tasks. The 
set of minimum completion times, M = fm i : nt = min 0 <. <m (CT[i, j]), for each ieU) is 
found. The task i with the overall minimum completion time from M is selected and 
assigned to the corresponding machine which would result in the minimum 
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completion time, based on the partial assignment of tasks so far. The newly mapped 
task is removed from U and the process repeated until all tasks are mapped (i.e. U = 
0) [6]. The computational complexity of Min-Min is 0(N 2 ). 

Max-Min: The Max-Min heuristic is similar to Min-Min and at each step finds the set 
of minimum completion times, M = fm i : m j = min 0 < <m (CT[i, j]), for each ieU}, but 
the task i with the highest completion time from M is selected and assigned to the 
corresponding machine. The newly mapped task is removed from U and the process 
repeated until all tasks are mapped (i.e. U = 0) [6]. The computational complexity of 
Max-Min is 0(N : ). 



3 Performance Metrics 

In this section, we define two metrics that we use to quantitatively capture the 
fundamental factors that affect the quality of mapping for a heterogeneous 
environment. Using these metrics, we then characterize the four static mapping 
heuristics described in Sec. 2. 

Efficacy Metric: In a heterogeneous system, the execution time of a task depends on 
the machine to which it gets mapped. For any given task, there exist one or more 
machines that are the most effective for that task, i.e. the execution time for that task 
is minimum on that(those) machine(s). The metric proposed below is a collective 
measure of efficacy for a mapping. We call this metric efficacy in order to distinguish 
it from the term efficiency, which generally denotes the fraction of peak machine 
performance achieved. 

We denote the Total Execution Time (TET), as the sum of execution times of all 
the tasks on their assigned machines, i.e., TET = XETCf i, Map fi]], where Mapfi] 
denotes the machine onto which task i is mapped. It represents the total number of 
computing cycles spent on the collection of tasks by all the processors combined. This 
is also equal to the sum of the machine availability times X MATfj] for all machines, 
for the mapping used. 

We define the Best Execution Time (BET), as the sum of machine availability 
times for all machines when all tasks are mapped to their optimal machines. BET = 
XETCfi, Bestfi] ], where Bestfi] represents a machine on which task i has the lowest 
execution time. 

The system efficacy metric is defined as: 

„ BET (1) 

E = 

TET 

Note that E always lies between 0 and 1. This is because for any task i, ETCfi, Bestfi]] 
< ETCfi, Mapfi]]. A value of 1 for E implies that all tasks have been assigned to 
maximally effective machines. A low value for E suggests that a large fraction of the 
work has been mapped to machines that are considerably slower for those tasks than 
the best suited machines for them. 

Utilization Metric:The utilization metric is a measure of overall load-balance 
achieved by the mapping. It is defined as: 
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_ TET (2) 

makespan x m 

This characteristic reflects the percentage of useful computing cycles as a fraction of 
the total available cycles. Note that U always lies between 0 and 1. This is because 
MAT[j] < makespan, for every machine j. Since TET = EMAT[j], and there are m 
machines, the result follows. A value of 1 for U implies perfect load balancing among 
all the machines. A low value for U implies poor load balancing. 

Although high values of efficacy and utilization are inherently desirable, it is 
generally impossible to achieve mappings that simultaneously maximize both. 
Consider, for example, a situation where a particular processor happens to be the 
fastest for every task. The only mapping that would maximize E is the one that maps 
all tasks to that fast processor. This would clearly result in very low utilization (7/m 
for a system with m machines). In terms of the metrics E and U, minimization of 
makespan requires the maximization of their product, since: 

, BET 1 (3) 

makespan = x 

m ExU 

In the following subsection, we characterize the previously mentioned mapping 
heuristics in terms of these two metrics. 



Characterization of Previously Proposed Mapping Heuristics 

A number of simulations were performed using the four mapping heuristics (UDA, 
Fast Greedy, Min-Min and Max-Min). Various combinations of heterogeneity 
parameters were experimented with. The observed trends did not change significantly 
with the values for the heterogeneity parameters for task and machine heterogeneity. 
Hence we present results for just one combination of parameters - low task 
heterogeneity, low machine heterogeneity and an inconsistent ETC matrix. Table 2 
shows the efficacy and utilization values for the same mappings whose makespans 
were summarized earlier in Table 1. 

Whereas there is no consistently superior scheme as far as the makespan is 
concerned, the picture is very consistent when viewed in terms of the metrics of 
efficacy and utilization. For all four cases, over a range of variation of parameters, 
there is a single consistent rank ordering of the four schemes in relation to each of the 
metrics. The ordering by increasing efficacy is: Max-Min, Fast Greedy, Min-Min, 
UDA. The ordering by increasing utilization is exactly the reverse: UDA, Min-Min, 
Fast Greedy, Max-Min. 

The analysis of the efficacy and utilization of the four methods helps explain the 
fact that no single method is consistently superior in relation to the overall makespan. 
When the average number of tasks per processor is low, the utilization is lower than 
when the average number of tasks per processor is high, but the difference in 
utilization of the different schemes is higher. However, the difference in efficacy 
across schemes is lower when the average number of tasks per processor is low. So 
for a low task-per-processor ratio, the difference in the ExU product for different 
schemes is dominated by the differences in utilization. Max-Min achieves better 
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utilization than Min-Min. It tends to assign the longest tasks first and therefore is 
better able to balance load at the end of the schedule by assigning the shorter jobs. In 
contrast, Min-Min tends to assign the shorter jobs first, and is generally not as 
effective in creating a load-balanced schedule because the longest jobs are assigned 
last and can cause greater load imbalance. Max-Min achieves an overall higher 
product of ExU and therefore lower makespan when the average number of tasks per 
processor is low. The reason is that Max-Min’ s utilization superiority over the other 
schemes more than makes up for its lower efficacy. As the average number of tasks 
per processor increases, the utilization of all schemes improves, tending 
asymptotically to 1.0. Thus the difference between the utilization of the various 
schemes decreases and the efficacy dominates the ExU product. 



Table 2. Efficacy, utilization and makespan values for mapping heuristics 



Metrics 


Heuristics 


16 tasks 


32 tasks 


64 tasks 


256 tasks 


E 


Max-Min 


0.69 


0.62 


0.56 


0.48 


Greedy 


0.85 


0.83 


0.82 


0.80 


Min-Min 


0.93 


0.93 


0.95 


0.98 


UDA 


1.00 


1.00 


1.00 


1.00 


U 


Max-Min 


0.79 


0.94 


0.98 


1.00 


Greedy 


0.60 


0.75 


0.85 


0.96 


Min-Min 


0.52 


0.65 


0.77 


0.92 


UDA 


0.43 


0.52 


0.60 


0.75 


ExU 


Max-Min 


0.54 


0.59 


0.55 


0.48 


Greedy 


0.51 


0.62 


0.69 


0.76 


Min-Min 


0.49 


0.60 


0.73 


0.89 


UDA 


0.43 


0.52 


0.60 


0.75 


Makespan 


Max-Min 


381.8 


719.8 


1519.6 


7013.6 




Greedy 


414.7 


680.8 


1213.4 


4427.1 




Min-Min 


431.8 


700.4 


1146.3 


3780.6 




UDA 


496.3 


832.1 


1424.8 


4570.3 



The efficacy of UDA is always 1. The efficacy of Min-Min stays above 0.95 and 
actually is slightly higher with 256 tasks than with 16 tasks. The efficacy of Fast 
Greedy deteriorates slightly as the average number of tasks per processor increases, 
while the efficacy of Max-Min deteriorates considerably (from over 0.7 to under 0.5). 
Hence as the average number of tasks per processor increases, Min-Min and UDA 
perform better, relative to Max-Min. The results in Tables 2 suggest that UDA and 
Min-Min are schemes that generally achieve consistently high efficacy, but are 
somewhat deficient with respect to utilization. In contrast, Max-Min scores 
consistently high on utilization, but suffers from poor efficacy. This observation 
serves as the basis for an approach to develop an improved mapping heuristic, 
detailed in the next section. 
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4 Enhancing Mapping Heuristics 

In this section, we show how the insights gained by use of the efficacy and utilization 
metrics can be used in enhancing existing mapping heuristics and/or developing new 
heuristics. First we provide the basic idea behind the approach and then demonstrate it 
by applying it to the UDA heuristic. 

When mapping independent tasks in a heterogeneous environment, in order to 
achieve a low makespan, the product ExU must be maximized. From the results of the 
previous section, we observe that some existing mapping schemes achieve high 
efficacy, while some others achieve higher utilization. None of the schemes is 
consistently superior with respect to both efficacy and utilization. The basic idea 
behind our approach to enhancing mapping heuristics is to start with a base mapping 
scheme and attempt to improve the mapping with respect to the metric that the 
scheme fares poorly in. Thus, if we were to start with Min-Min or UDA, we find that 
efficacy is generally very good, but utilization is not as high as desired. On the other 
hand, Max-Min is consistently superior with respect to utilization, but often has very 
poor efficacy. If we start with an initial mapping that has high efficacy, we could 
attempt to incrementally improve utilization without suffering much of a loss in 
efficacy. In contrast, if we start with a high-utilization mapping, the goal would be to 
enhance efficacy through incremental changes to the mapping, without sacrificing 
much utilization. 

The UDA heuristic stands at one end of the efficacy-utilization trade-off spectrum 
- it always has a maximal efficacy of 1.0, but usually has poor utilization. The idea 
behind the proposed enhancement is to start with the initial high-efficacy mapping 
produced by UDA, and progressively refine it by moving tasks, so that utilization is 
improved without sacrificing too much efficacy. This is done in two phases - a 
Coarse-Tuning phase, followed by a Fine-Tuning phase. 



4.1 Coarse Tuning 



Starting from an initial mapping that has very high efficacy, the objective of the 
Coarse Tuning phase (CT) is to move tasks from overloaded machines to underloaded 
machines in a way that attempts to minimize the penalty due to the moves. This 
penalty factor for moving task i from machine j to machine k is defined as: 



move _ penalty(i, j,k) 



ETC[i,k\ 

ETC\i,j\ 



(4) 



The algorithm uses an Upper Bound Estimate (UBE) for the optimal makespan. This 
is determined by use of any fast mapping heuristic. Here we use the Fast-Greedy 
algorithm to provide the estimate. Fast-Greedy has a running time of O(N) for N tasks 
(whereas Min-Min and Max-Min are 0(N 2 ) algorithms). The CT phase attempts to 
move tasks from all overloaded processors (with total load greater than UBE) to 
underloaded processors (with total load under UBE) so that at the end of the phase, all 
processors have a total load of approximately UBE. 

The moves are done by considering the penalties for tasks on the overloaded 
processors. The CT phase first considers the tasks mapped to the maximally loaded 




44 P. Holenarsipur et al. 



processor. For each task, the minimal move-penalty is computed, for a move from its 
currently mapped machine to the next best machine for the task. The tasks are sorted 
in decreasing order of the minimal move-penalty. The CT phase attempts to retain 
those tasks with the highest move-penalty and move tasks with lower move-penalties 
to other processors. Scanning the tasks in sorted move-penalty order, tasks are marked 
off as immovable until the total load from the immovable tasks is close to UBE. The 
remaining tasks are candidates for moving to underloaded processors. 

For each movable task on an overloaded processor, a move-diff penalty (MDP) is 
then computed. Initially this is the difference between the task’s execution times on 
the third-best machine and second-best machine for the task, normalized by the 
execution time on the best machine (i.e. the currently mapped machine, since UDA 
maps all tasks to their best machines): 

MDp(i 1} _ ETC\i, m3] - ETC\i, m2] ( 5 ) 

ETC[i,ml ] 

where ml, m2, m3 are respectively the best, second best and third best machines for 
task i. 

As the algorithm progresses, and some of the machines get filled, the move-diff 
penalty is recalculated using the two next best (unfilled) machines. Thus in general : 

£ ( 6 ) 
ETC\i,m\\ 

where ml, m2, m3 are respectively the best, k"‘ best and (k+1 )"' best machines for task 
i. In the extreme case when k = m, we set ETC[i,m+l 7 = 0. 

The movable tasks are moved in decreasing order of their move-diff penalty. The 
rationale behind this criterion is as follows. All movable tasks are to be moved to 
some machine or the other. The lowest penalty for each task will result if it is moved 
over to its second-best machine. However, it is very likely that after several tasks are 
so moved, the more powerful of the underloaded machines will reach a load close to 
UBE, preventing additional moves onto them. Later moves will likely be forced onto 
less powerful underloaded machines, thereby suffering a larger move penalty. So it 
would be preferable to first move those tasks which would have the greatest penalty if 
they were forced to their third-best choice instead of their second-best choice, and so 
on. This is captured by the move-diff penalty. Due to space limitations, we omit the 
pseudo-code for the heuristic, and refer the reader to [11] for details. 



4.2 Fine Tuning 

After the Coarse Tuning phase, all processors are expected to have a load of 
approximately UBE. A fine-tuning load-balancing phase is then employed. It first 
attempts to move tasks from the most loaded machine to some under-loaded machine, 
so that the total makespan decreases maximally. Tasks in the most loaded machine are 
ordered by the penalty for moving the task to the worst machine. Task moves are 
attempted in order of increasing penalty, i.e. those tasks suffering the lowest penalty 
from moving are moved first. Note that the order of task selection for moves is 
exactly the opposite to that used in the Coarse Tuning phase - during Fine Tuning, the 
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lowest penalty tasks are moved first, while the highest penalty tasks are first moved 
during Coarse Tuning. This difference is a consequence of the fact that with Coarse 
Tuning, due to the knowledge of UBE, all tasks to be moved are known a priori. In 
contrast, during Fine Tuning, we do not know how many tasks can be moved with 
decrease in makespan. When no more task moves from the most-loaded machine are 
possible that reduce makespan, pair-wise swaps are attempted between the most- 
loaded machine and any other machine. The reader is referred to [11] for details. 



5 Performance Evaluation 

In this section, we compare the four existing heuristics discussed earlier - Min-Min, 
Max-Min, Fast-Greedy and UDA, with the enhanced UDA-based schemes. The 
UDA-CFT version incorporates two stages of enhancements: coarse tuning and fine 
tuning. We also studied the effectiveness of both these enhancements independently. 

Simulation experiments were performed for different heterogeneity parameters for 
tasks and machines, as well as for varying number of tasks per processor. It was 
observed that the relative effectiveness of the heuristics did not change much with 
respect to machine and task heterogeneity. Therefore we only present results for a 
single value of task heterogeneity (100) and machine heterogeneity (10). Fig. 1 
shows results for the seven schemes: Min-Min, Max-Min, Fast-Greedy, UDA, UDA- 
CFT, UDA-Coarse, and UDA-Fine, The data displayed includes the efficacy, 
utilization, and makespan metrics for the heuristics. The metrics are shown for 
different number of tasks being mapped to 8 machines. 

Fet us first consider the data for the 16 tasks case. The graph displaying E vs. U 
shows that UDA has an efficacy of 1 .0, but a utilization of only 0.42. The application 
of the Coarse Tuning and Fine Tuning (UDA-CFT) to UDA results in considerable 
improvement to the utilization (to almost 0.7) with a slight loss of efficacy (to a little 
under 0.9). The ExU product rises to around 0.6 from 0.42, with a corresponding 
decrease in makespan. The performance difference between the UDA-FT and UDA- 
CFT schemes is very small. This is because UDA produces a mapping with makespan 
close to the makespan of Fast-Greedy, i.e. there is very little scope for tasks to be 
moved from over-loaded to under-loaded machines. Dike UDA, Min-Min also 
displays high efficacy (around 0.96) but low utilization (around 0.52). With Max-Min, 
we have higher inherent utilization (0.8) but lower efficacy (0.72). 

The trends are very similar for the other three cases shown in Fig. 1. In all cases, 
the enhanced versions improve utilization at the price of a small decrease in efficacy. 
As the average number of tasks per processor increases, the utilization of the base 
schemes increases. As a consequence, the extent of improvement possible through the 
enhanced schemes diminishes. The reader is referred to [1 1 ] for results using different 
values of task/data heterogeneity and average number of tasks per processor, for 
consistent and inconsistent systems. The results for other cases are qualitatively 
similar to those seen in Fig. 1. 

Table 3 presents data on the execution time for the various heuristics, i.e. the time 
needed to execute the heuristics to generate the mappings. The execution times were 
obtained by executing the heuristics on a Sun Ultra 4 Sparc machine. The number of 
tasks was varied from 256 to 2048, with the number of processors of the simulated 
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system being 32. The execution time for the UDA-CFT heuristic can be seen to be 
about two orders of magnitude lower than the Min-Min or Max-Min heuristics. 



Table 3. Execution times (seconds) for mapping heuristics 



Heuristic 


Number of tasks (32 machines) 


256 


512 


1024 


2048 


Greedy 


<0.01 


<0.01 


0.01 


0.01 


UDA 


<0.01 


<0.01 


<0.01 


0.01 


UDA-Coarse 


<0.01 


0.01 


0.01 


0.03 


UDA-CFT 


<0.01 


0.01 


0.02 


0.05 


UDA-Fine 


<0.01 


<0.01 


0.01 


0.02 


MinMin 


0.19 


1.00 


3.16 


12.46 


MaxMin 


0.20 


1.05 


3.33 


13.07 



6 Conclusions 

The aim of this work was to obtain insights into different static mapping heuristics for 
heterogeneous environments by use of two metrics - efficacy and utilization. These 
metrics provide a clear and consistent characterization of several previously proposed 
mapping heuristics. The insights obtained from the characterization can be useful in 
improving existing mapping schemes and in developing new strategies. This was 
demonstrated by developing an enhancement to the UDA mapping scheme. Starting 
with a mapping that had maximum efficacy, a two-phase tuning heuristic was 
proposed to improve the utilization of the mapping. The enhanced scheme produces 
mappings that are comparable or better than Min-Min (reported to be the best 
heuristic in previous studies), with the total execution time for the enhanced UDA 
scheme being over an order of magnitude lower than Min-Min. We believe that these 
metrics will also be useful in understanding and characterizing task mapping 
strategies in the more complex context of online dynamic scheduling. 



References 

1. A.H. Alhusaini, V. K. Prasanna, and C. S. Raghavendra. A Unified Resource Scheduling 
Framework for Heterogeneous Computing Environments, 8th Heterogeneous Computing 
Workshop (HCW '99), Apr. 1999 

2. R. Armstrong, D. Hensgen, and T. Kidd. The Relative Performance of Various Mapping 
Algorithms is Independent of Sizable Variances in Run-time Predictions, 7 th IEEE 
Heterogeneous Computing Workshop (HCW’98), Mar. 1998, pp.79-87 

3. B. Armstrong, R. Eigenmann. Performance Forecasting: Towards a Methodology for 
Characterizing Large Computational Applications. Proceedings of the 1998 International 
Conference on Parallel Processing, Aug. 1998, pp. 518-527 

4. F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. Aplication-Level Scheduling on 
Distributed Heterogeneous Networks, Proceedings of Supercomputing 1996 





Static Mapping Heuristics for Heterogeneous Systems 47 



5. F. Berman and R. Wolski. Scheduling from the Perspective of the Application, from 
Proceedings of Symposium on High Performance Distributed Computing, 1996 

6. T. D. Braun, H. J. Siegel, N. Beck, L. L. Boloni, M. Maheswaran, A. I. Reuther, J. R. 
Robertson, M. D. Theys, Bin Yao, D. Hensgen, and R. F. Freund. A Comparison Study of 
Static Mapping Heuristics for a Class of Meta-tasks on Heterogeneous Computing Systems, 
8' h IEEE Heterogeneous Computing Workshop (HCW'99), Apr. 1999, pp. 15-29 

7. T.D. Braun, H.J. Siegel, N.Beck, L.L. Boloni, M. Maheswaran, A.I. Reuther, J.P. Robertson, 
M.D. Theys, and B. Yao. A Taxonomy for Describing Matching and Scheduling Heuristics 
for Mixed-machine Heterogeneous Computing Systems, IEEE Workshop on Advances in 
Parallel and Distributed Systems, Oct. 1998, pp. 330-335 

8. F. Chang, V. Karamcheti, Z. Kedem. Exploiting Application Tunability for Efficient, 
Predictable Parallel Resource Management. Proceedings of the 13* International Parallel 
Processing Symposium / Symposium 10* Symposium on Parallel and Distributed 
Processing, Apr. 1999, pp.749-758 

9. D. Feitelson, A. Weil. Utilization and Predictability in Scheduling the IBM SP/2 with 
Backfilling. Proceedings of the 12* International Parallel Processing / Symposium 9* 
Symposium on Parallel and Distributed Processing, Apr. 1998, pp. 542-548 

10. I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure, 
Morgan Kaufmann Publishers, 1998 

11. P. Holenarsipur, V. Yarmolenko, J. Duato, D.K. Panda, and P. Sadayappan, 
"Characterization and Enhancement of Static Mapping Heuristics for Heterogeneous 
Systems," Technical report OSU-CISRC-2/00-TR07, Department of Computer and 
Information Science, Ohio State University, 2000 

12. O. H. Ibarra and C. E. Kim. Heuristic Algorithms for Scheduling Independent Tasks on 
Nonidentical Processors. Journal of the ACM, Vol.24, No.l, Jan. 1977, pp. 280-289 

13. M. Kafil and I. Ahmad. Optimal Task Assignment in Heterogeneous Computing Systems. 
6* IEEE Heterogeneous Computing Workshop (HCW'97), Apr. 1997, pp. 135-146 

14. W. Leinberger, G. Karypis, and V. Kumar. Multi-Capacity Bin Packing Algorithms with 
Applications to Job Scheduling under Multiple Constraints. Proceedings of the 1999 
International Conference on Parallel Processing, Aug. 1999, pp. 404-413 

15. M. Maheswaran, Shoukat Ali. H. J. Siegel, D. Hensgen, and R. F. Freund. Dynamic 
Matching and Scheduling of a Class of Independent Tasks onto Heterogeneous Computing 
Systems. 8"' IEEE Heterogeneous Computing Workshop (HCW"99), Apr. 1999, pp. 30-44 

16. A. Radulescu and A. van Gemund. FLB: Fast Load Balancing for Distributed-Memory 
Machines. Proceedings of the 1999 International Conference on Parallel Processing, Aug. 
1999, pp. 534-542 

17. J. Schopf, F. Berman. Performance Prediction in Production Environments. Proceedings of 
the 12* International Parallel Processing/Symposium 9* Symposium on Parallel and 
Distributed Processing, Apr. 1998, pp. 647-654 

18. G. Shao, R. Wolski and F. Berman. Performance Effects of Scheduling Strategies for 
Master/Slave Distributed Applications. UCSD Technical Report No. CS98-598 

19. H. Singh and A. Youssef, Mapping and Scheduling Heterogeneous Task Graphs using 
Genetic Algorithms. 5* IEEE Heterogeneous Computing Workshop (HCW’96), Apr. 1996 

20. H. Topcuoglu, S. Hariri, and Min-You Wu. Task Scheduling Algorithms for Heterogeneous 
Processors. 8* IEEE Heterogeneous Computing Workshop (HCW’99), Apr. 1999, pp. 3-14 

21. L. Wang, H. J. Siegel and V. P. Roychowdhury. A Genetic-Algorithm-Based Approach for 
Task Matching and Scheduling in Heterogeneous Computing Environments. 5* IEEE 
Heterogeneous Computing Workshop (HCW’96), Apr. 1996 

22. L. A. Yan, J. K. Antonio. Estimating the Execution Time Distribution for a Task Graph in a 
Heterogeneous Computing System. 6* IEEE Heterogeneous Computing Workshop 
(HCWT17), Apr. 1997, pp. 172-184 




48 P. Holenarsipur et al. 



8 machines, 16 tasks 
















( 






















■ 














• 


io'- 













3 0,4 
0,3 
0,2- 
0,1 


1 ♦ MinMin 

— ■ MaxMin 
A Greedy 

x UDA 

— x UDA-Coarse 
• UDA-CFT 

+ UDA-Fine 













































0,2 0 


4 0 

Efficacy 


6 0 


8 


8 machines, 32 tasks 












■ 
















• 












A 












4 * 


Utilizatio 

o poop c 














) 




4 MinMin 
■ MaxMin 
A Greedy 
X UDA 

x UDA-Coarse 
• UDA-CFT 
+ UDA-Fine 
















































0,2 0,4 0,6 0,8 

Efficacy 


8 machines, 64 tasks 










■ 




• 


< 










- 














4' 














Utilizatior 

3 O O O O C 
















4 MinMin 
■ MaxMin 
A Greedy 
X UDA 

X U DA-Coarse 
• UDA-CFT 
+ UDA-Fine 
















































0,2 0 


4 0 

Efficacy 


6 0 




8 machines, 128 tasks 













L 














♦ 


















c 














1 °.5 
5 0,4 

as 

as 

a, 


















1 


4 MinMin 
■ MaxMin 
A Greedy 
X UDA 

X UDA-Coarse 
• UDA-CFT 
+ UDA-Fine 


























































0,2 


4 

Efficacy 


6 


8 





8 machines, 16 tasks 




Coarse 



8 machines, 32 tasks 




MinMin MaxMin Greedy UDA UDA- UDA-CFT UDA-Fine 
Coarse 



8 machines, 64 tasks 




MinMin MaxMin Greedy UDA UDA- UDA-CFT UDA-Fine 



8 machines, 128 tasks 




Coarse 



Fig. 1 . Efficacy, utilization and makespan for mapping heuristics 






Optimal Segmented Scan 
and Simulation of Reconfigurable Architectures 
on Fixed Connection Networks* 



Alan A. Bertossi and Alessandro Mei 

Department of Mathematics, University of Trento, Italy, 
{bertossi ,mei}@science . unitn. it 



Abstract. Given n elements xo, ■ ■ . , x n -i, and given n bits bo , . . . , b„~ 1, 
with at least one zero, the segmented scan problem consists in finding 
the prefixes s; = Xi 0 mo d i = 0, . . . , n — 1, where ® is an 

associative binary operation that can be computed in constant time by 
a processor. This paper presents: 

(i) an 0(log B ) time optimal algorithm for the segmented scan problem 
on a (2n— l)-node toroidal X-tree, where B is the maximum distance 
of two successive zeroes in bo, . . . , b n ~i; 

(ii) a novel definition of locally normal algorithms for trees and meshes 
of trees; 

(iii) a constant slow-down, optimal, and locally normal simulation al- 
gorithm for a class of reconfigurable architectures on the mesh of 
toroidal X-trees, if the log-time delay model is assumed; 

(iv) a constant slow-down optimal simulation of locally normal algo- 
rithms for meshes of toroidal X-trees on the hypercube. 



1 Introduction 

Given n elements Xo,... , x n _i, and an associative operation a scan (also 
known as parallel prefix) operation consists in computing the prefixes Si = Xq <S> 
Xi <S> ■ ■ ■ <S> Xi, i = 0, . . . ,n—l. 

Scans are probably the most important parallel operations. They are so com- 
monly used in computation that research has been done in order to endow se- 
quential systems with scans as primitive parallel operations @| ■ A lot of research 
has been done to find algorithms for this problem on most of the known parallel 
models of computation, from PRAMs to trees and hypercubes. For example, 
O(logn) time is needed to perform a scan operation on an 0(n/ log n)-processor 
EREW PRAM, on an 0{n/ log n)-node tree, and on an 0(n/ log ?r)-node hy- 
percube. All these results are clearly time and work optimal, within constant 
factors. 

An important particular case of this problem arises when the scan operation 
is to be done separately on segments of the input vector Xo,... ,x n _i. The 

* This work has been supported by grants from the University of Trento and the 
Provincia Autonoma di Trento. 
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segments are indicated by giving n bits bo,... ,b n - 1 , and the segmented scan 
problem thus consists in computing the prefixes: 



50 — Xo 

51 = Xi 0 biS 

where i = 1, .. . , n— 1. Of course, this problem can still be solved in 0(log n) time 
on the above parallel models, as this is just a particular scan operation. However, 
if there are enough 6jS set to zero in the segmenting sequence, the dependencies of 
the output prefixes are so localized that sub-logarithmic algorithms could exist, 
and an l?(logn) time lower bound does not hold any more. Indeed, one expects 
the above operation to require just O(logH) time to be completed, where B 
is the size of the longest segment, which is clearly a lower bound on the time 
needed to perform a segmented scan on the above models. In this paper, it is 
shown that this is an upper bound too, for X-trees and hypercubes. 

Strangely enough, this nice property of segmented scans has never been re- 
marked in the literature. The main problem is probably that it is not clear what 
is the improvement. Indeed, O(logn) time is still needed to know B , and thus 
to predict the time needed by the operation. However, in some situations, it is 
possible to know in advance a sub-linear upper limit on B , and thus that the 
problem can be solved in sub- logarithmic time. 

An important example of this situation arises when algorithms written for 
reconfigurable architectures are to be simulated on fixed-connection networks 
like trees and hypercubes. Indeed, the peculiar feature of a reconfigurable ar- 
chitecture of segmenting its buses in order to perform different broadcasting 
operations on different segments of buses, is a case in which, under proper as- 
sumptions, B can be exactly predicted. Exploiting the above result, it is thus 
possible to develop optimal simulation algorithms for a class of reconfigurable 
architectures. 

This paper is organized as follows: Section 2 reviews the models of par- 
allel computation used; Section 3 describes a time optimal algorithm for the 
segmented scan problem on the toroidal X-tree; as applications, Section 4 and 
Section 5 show how the mesh of toroidal X-trees an the hypercube optimally 
simulate a class of reconfigurable architectures. 

2 Fixed-Connection and Reconfigurable Networks 

A fixed- connection network pj is a parallel model of computation consisting in a 
network of synchronous processors. The topology of a fixed-connection network 
is usually described by means of an undirected graph Q = (V,E), where nodes 
stand for processors, and edges for links between processors. 

Each processor of the network has a local control program and a local storage, 
and is allowed to perform a constant number of word operations. The size of a 
word is O(logro), where n is the size (i.e. number of nodes) of the network. We 
will also limit the complexity of the operations, by allowing only elementary 
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arithmetic. The edges of the graph Q = (V, E) describe the topology. If uv G E , 
the processors represented by nodes u and v are connected by a physical link, 
which can be used to carry communications between the two processors. 

Time is divided into steps by a global clock. At each step, each processor: 

(1) reads messages coming from its neighbors, 

(2) performs the computation described by its local control, 

(3) sends a constant number of words to its neighbors. 

A reconfigurable network is a network of processors operating synchronously. 
Differently from a fixed-connection network, each node can dynamically connect 
and disconnect its edges in various patterns by configuring a local switch. 

Each switch has a number of I/O ports and each port is directly connected 
to at most one edge. While the edges outside the switch are fixed, the internal 
connections between the I/O ports of each switch can be locally configured in 
0(1) time by the processor itself into any partition of the ports. In this way, 
during the execution of an algorithm, the edges of the network are dynamically 
partitioned into edge-disjoint subgraphs. Every such subgraph forms a sub-bus, 
and allows the processors of the sub-bus to broadcast a message to all the other 
processors sharing the same sub-bus. 

Several variants of the model described above have been defined, depending 
on the kind of allowed partitions and of the sub-bus arbitration. 

— Linear reconfigurable network (LRN). Only partitions made of pairs and 
singletons are allowed (see Figure 0(a)). In this way, each sub-bus has the 
form of either a path or a cycle. 

— General reconfigurable network (RN). Any partition of the ports is allowed 
(see also Figure 0b)). Thus, the possible configurations are any graph par- 
tition in edge-disjoint connected subgraphs. 

Moreover, two main models are considered with respect to the kind of sub- 
bus arbitration: the exclusive-write model, where each port of the network must 
be reached by at most one message at any step of computation; and the common- 
write model, where more messages are allowed to reach the same port during 
the same step, provided that all of them contain the same value which in turn 
consists in a single bit only. 

Finally, two models have been considered in the literature for delay along 
buses. According to a so called unit-time delay model |GJ, the time required by 
a broadcast operation to be completed is constant, regardless of the length of 
the sub-bus. Also considered has been a log-time delay model which assume 
that the time required by a broadcast operation is logarithmic on the length (i.e. 
number of switches traversed) of the sub- bus. 

A reconfigurable mesh is a reconfigurable network whose topology is a 2-di- 
mensional grid. An important variant of this model is the basic reconfigurable 
mesh, where the connection patterns are limited in such a way that sub-buses 
can run only along rows and columns, thus no bent is allowed. More formally, 
the only patterns allowed are those shown in Figure El Clearly, all the variants 
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{X.S.E.W) {.YE. 5. U' | {jV.5E.W-} {jV.SW.EJ {.VIV.S.E} 

{A' 5. E. H) {jV.S.EW'J (YE.SIV) {.VW.5E} (.V5.EH"} 

(a) LRN Model 

{YEW-'. 5} {.VSE.W} {V.SEtr) {YSW.E} {Ai’EH} 

(b) additional configura- 
tions allowed in the RN 
model 

Fig. 1. All the configurations allowed on a four port switch. 

o o o ^ 

(jV.S.E.H-'} IXS.E.W) { :V. S. E\V } {YS.EH} 

Fig. 2. All the configurations allowed on a basic reconfigurable mesh. 



of the reconfigurable network model apply to the (basic) reconfigurable mesh, 
too. 

In this paper, the log-time delay basic reconfigurable mesh is assumed. More- 
over, we will endow the grid with toroidal links, which connect the first and last 
node on each row and each column (see Figure E!]). This is useful in order to 
consider other well-known reconfigurable architectures as particular cases of the 
toroidal basic reconfigurable mesh. Namely, we refer to the HV reconfigurable 
mesh [^, the mesh with separable buses [E], and the polymorphic-torus network 
[IE|. All the results for the basic reconfigurable mesh presented in this paper 
directly extend to all of them. 




Fig. 3. A 4 x 4 basic reconfigurable mesh with toroidal links. 
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3 Optimal Segmented Scan Operations 
on Toroidal X- Trees 

Let A be a set whose elements can be coded in a word, and let <g) be an associa- 
tive binary operation over A, such that (© can be computed in 0(1) time by a 
processor. 

Given an n-uple, n = 2 fc , of elements Xq, ■ ■ ■ , x n _i £ A, and given an n-uple 
of bits bo, . . . , ft„_ i, with at least one zero, the segmented scan problem consists 
in finding the prefixes 



A Xi © biS(i_l) mod m 



where * = 0, . . . , n — 1. 

The bits bo, . . . , b n - 1 define a partition of Xo, ■ ■ ■ , x n ~i in the following way. 
Let ft. > 0 be the number of zeroes in bo,-- - ,b n - 1 , and let ij, j = 1,... , ft, 
be the index of the j-th zero in the sequence. Naturally, a partition of the 
indices 0, . . . , n — 1 is given by B i, . . . , Bh, where Bj = + 1, . . . , ij+i — 1}, 

j = 1, . . . , ft— 1, and Bh = { ih , i^+l, • • • , 0, 1, . . . , i\ — 1}. Clearly, the segmented 
scan problem consists in computing a scan operation independently inside each 
set of the partition. 

This problem has a straightforward lower bound of f2(logmaxj \Bj\) time 
on a fixed-connection network. Unfortunately, the natural solution of letting a 
classical complete binary tree solve the problem, by providing Xq, . . . ,x n -i to 
its leaves, is not optimal. Indeed, assuming n > 4 to be a multiple of 4, if the 
partition is such that Bj = {2 j — 1, 2 j mod n}, j = 1, . . . , n/2, then B n / 4 has 
one element in the left subtree of the root, and the other element in the right 
one. The two leaves are thus h?(logn) far, and this is a lower bound on the time 
needed by the tree to solve the problem. In this case logmaxj Bj = 1, hence, a 
binary tree cannot match the l2(logmaxj B t \ ) time lower bound. 

In order to optimally solve this problem, a slightly more complex architecture 
is needed. 



3.1 The Toroidal X-Tree 

A toroidal X-tree (TX-tree) T = ( V , E ) of size m = 2 fc+1 — 1 is a fixed-connection 
network composed by m nodes v\, . . . , v m £ V, connected by bidirectional links 
in a tree-like fashion. The nodes are partitioned into k + 1 levels. The l - th level, 
l = 0, . . . , k, is formed by nodes v 2 ‘ , ■ ■ ■ , v 2 i+i_i. The nodes on level l, except 
level 0, are connected in such a way that v 2 i+i has a bidirectional link to v 2 i +i+ 1 , 
i = 0, . . . , 2 l — 2, and v 2 i+i_ 1 to v 2 i . Moreover, each node Vj, i = 2, . . . , 2 l — 1, is 
linked to its parent u^/ 2 j in level l— 1 (see Figure0). Intuitively, T is a complete 
binary tree of to nodes which has been extended by adding links in such a way 
to form a ring on each level of the tree. 

For the sake of clarity, it is useful to define a new operation © 1 over the set 
{!,... ,m} of indices of the nodes of a toroidal X-tree. This operation will be 



56 



A. A. Bertossi and A. Mei 




Fig. 4. A toroidal X-tree of 15 nodes. 

used to find the index of the “next” node in the ring of nodes at any level of the 
TX-tree. More formally, 

( 2 l if x = 2 l+1 — 1, for some l > 0, 

i© 1 = < . „ 

I * + 1 otherwise. 

Similarly, a 0 1 operation can be defined which produces the index of the “pre- 
vious” node in the ring. We also use a © c ( 0 c) operation, where c is a positive 
constant, obtained by iterating © 1 ( 0 1) c times. 

3.2 An Optimal Algorithm for the Segmented Scan Problem 

We start by giving a “weak” definition of the segmented scan problem, which 
will be useful in the second part of this paper. 

Definition 1. Given an n-uple, n = 2 k , of elements Xq,--- ,x n -i €E A, and 
given h segments Bj = { aj , (ay + 1) mod n, . . . , ujj} C {0, n — 1} such that: 

(1) aj < a j+1 , for all j ,h- 1}; 

(2) | Bj D Bj + 1 | < 1, for all j € {1, ... ,h— 1}; and | £?/, n B\\ < 1; 

the overlapping segmented scan problem consists in computing a scan operation 
within each segment Bj, j = 1, . . . ,h. 

It is easy to see that the segmented scan problem is a particular case of the 
overlapping version, where the segments Bj, j = 1,... ,h, form a partition of 
{0, . . . ,n—l}. 

A toroidal X-tree T = ( V, E) of size m = 2n — 1 has n = 2 k leaves and can be 
used to optimally solve an overlapping segmented scan problem of size n. Assume 
that xq, • ■ • , x n -\ are stored in the leaves of the TX-tree in such a way that the 
node of index i + 2 k stores ay, i = 0, . . . , n — 1. Also assume that it stores a two 
bit register q = such that Qp is equal to zero if and only if there exists j 

such that aj = i, and Cj t 2 is equal to zero if and only if there exists j such that 
cjj = i. Note that a segmented scan problem with parameters bo, . . . , b n - 1 can 
be converted into an overlapping one by setting Cj = bib( i+1 ) mo( j n . 

For each segment Bj, j = 1, . . . , h, define a sub-TX-tree STj = ( Vj,Ej ) in 
the following way. Iq ' is the set of indices of the nodes storing an integer ay such 
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that i £ Bj. Recursively define I r+l , r > 0, to be the set of indices of the nodes 
having both sons in 1^. Let r ? be the minimum index such that \Ir^\ < 2. 
Then, let Vj, the set of nodes of STj, be equal to U • • • U and let Ej 
contain all the edges in E connecting two nodes of Vj (see Figure 0for example). 
It is useful to state a few simple facts on the structure of STj . 




Fig. 5. In this example, B\ = {1, . . . ,4}, B 2 = {4, . . . ,7}, B 3 = (7, . . . , 10}, B 4 = 
{10, . . . , 15, 0, 1}. Nodes in ST 2 and ST 4 are dark, while nodes ST\ and ST 3 are light. 
Edges and nodes which do not belong to any STj are dashed. 



Fact 1 For all j = 1, . . . , h, STj is connected. 
Fact 2 For all j = 1, . . . , h, rj is 0(log Bj \ ) . 



Fact 3 For all j = 1, . . . , h, and for all nodes Vj £ STj \ at least one of 
f|;i/ 2 J' ^Lbsi)/ 2 !' an d is STj; 

STj is well-suited to compute the prefix operation inside Bj. Indeed, the 
topology of STj is close to the topology of a tree, and its low diameter allows to 
achieve optimal time. 

Lemma 1. For all j = 1, . . . , h, the diameter of STj is 0(log \Bj\). 

At this point, we can show how a sub-TX-tree STj can solve the problem in 
the segment Bj. Initially, each leaf v i+2 k, i = 0, . . . , 2 k — 1, stores Xi in a register 
Ui+2 k ; and the two bits Cj in a two bit register d i+2 k . 

Lemma 2. The scan operation on the values Xi, i £ Bj, can be performed in 
0(log \Bj\) time on the sub-TX-tree STj. 

Theorem 4. A toroidal X-tree of size 2n — 1 can time- optimally solve the over- 
lapping segmented scan problem in <9(max_,Tog |Sj|) time. 

Corollary 1. A toroidal X-tree of size 2n—l can time- optimally solve the seg- 
mented scan problem in Ofmaxj log \Bj\) time. 
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4 The Mesh of Toroidal X-Trees 



An r x c mesh of toroidal X-trees , with r and c powers of two, is a fixed-connection 
network composed by r row and c column TX-trees of size 2c — 1 and 2 r — 1, 
respectively, sharing their leaves. Specifically, the leaf Vj +C , j = 0, . . . , c — 1, of 
the i-tli row TX-tree, i = 0, . . . , r — 1, is also the leaf Vi+ r of the j-th column 
TX-tree. Let this node be denoted by Pij. By construction, the nodes P lt j, 
i = 0, . . . , r — 1 and j = 0, . . . , c — 1, form a grid of processors. As it can be 
easily seen, an r x c mesh of toroidal X-trees is very similar to an r x c mesh of 
trees, with toroidal X-trees in the place of classical trees. 

4.1 Optimal Simulation of Basic Reconfigurable Meshes 

With a few adaptations, the broadcast operation along each sub-bus of a basic 
reconfigurable mesh becomes a segmented scan operation. Let the associative 
operation used, be defined as follows: 



where x and y are two messages sent by two processors, _L is a symbol indicating 
that the processor is not transmitting any message, and error is a symbol for 
detecting multiple writes on the bus. 

Fact 5 is associative. 

Consider a row of an r x c basic reconfigurable mesh, with r = 2 kl and 
c = 2 k2 . At the first step of the computation, each processor opens or closes the 
switch on the row bus. Let bi be either zero, if the switch of the *-th processor is 
open, or one, if it is closed. A broadcast operation along the sub-buses created 
can be simulated by executing two instances of an overlapping segmented scan 
with Xo,... ,x n -i being the messages, and Ci = bibi , i = 0, ... , n — 1, the 
bits defining the segments B i, . . . ,Bh- This can be done in the following way. 
First compute the prefixes sj, . . . , sl l _ 1 resulting from the computation of Xq ^ 
aq x n -\. Then, compute the prefixes Sg, . . . , s„_ 1 resulting from the 

computation of x n -± x n -2 "->••• Xo . Clearly, both results can be obtained 
by Theorem 0 the latter being a specular version of the former. Finally, the 
result of the broadcast operation is easily computed by the equation: 




I 



_L if si = X and sf = _L, 

s* if sj ^ T and sf = _L, 



error otherwise. 



if si = ± and sf ^ _L, 



Theorem 6. Given an algorithm A for an r x c basic reconfigurable mesh, r = 
2 kl and c = 2 k2 , an r x c mesh of toroidal X-trees can optimally simulate A with 
constant slow-down, using the log-time delay model. 
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5 The Hypercube 

A k-dimensional hypercube is a fixed-connection network consisting of n = 2 k 
nodes. Each node has a unique index in {0, ... , n — 1}, and two nodes are linked 
with an edge if and only if the binary representations of their indices differ in 
precisely one bit. 

5.1 Optimal Simulation of Basic Reconfigurable Meshes 

The problem of simulating a basic reconfigurable mesh on a hypercube is solved 
by giving a simulation scheme for meshes of TX-trees on hypercubes. Indeed, 
combining this result with Theorem El the simulation of basic reconfigurable 
meshes on hypercubes is obtained as a byproduct. 

It is well-known that a fc-dimensional hypercube easily simulates with con- 
stant slowdown normal algorithms for a (2 fc+1 — l)-node complete binary tree pj . 
However, the problems here are that the simulation algorithm of Theorem 0 is 
not normal, and that a toroidal X-tree has more wires than a simple tree. While 
the latter is solved as a simple application of Gray codes, the former involves a 
weaker and novel definition of normality. 

Definition 2 (|SJ). An algorithm A for a tree is said to be normal if: 

(a) it uses only nodes of one level at each step of computation, and 

(b) consecutive levels are used in consecutive steps. 

As said above, the simulation algorithm of Theorem0is not normal, because 
nodes at different levels are used in different parts of the TX-tree, but it is 
intuitively almost normal. The following definition formalizes this intuition. 

Definition 3. An algorithm A for a tree is said to be locally normal if for each 
root-leaf path: 

(a) it uses only one node at each step of computation, and 

(b) consecutive nodes are used in consecutive steps. 

The above definition directly applies to a number of tree-based architectures 
like X-trees, toroidal X-trees, meshes of trees, and meshes of toroidal X-trees. 
It is immediate to see that a normal algorithm is also locally normal, while the 
opposite is not true in general. In particular, this is the case of the simulation 
algorithm of Theorem 0 

Lemma 3. The simulation algorithm of Theorem^ is locally normal. 

By using the reflected Gray code, it is possible to find a mapping from the 
nodes of a TX-tree into a hypercube which permits an optimal simulation of 
locally normal algorithms. 

Let g[ u '\ w > 1, s = 0, . . . , 2 W — 1, be the s-th element of the w-bit binary 
reflected Gray code. We map the i-th node of the Z-th level of the ( 2 k+1 — 1)- 
node TX-tree into the node of index 0 k ~ l of the 2 k node hypercube, where 
0 X stands for a sequence of x zeroes. Note that, in this way, adjacent nodes 
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of the TX-tree are mapped to either the same node or adjacent nodes of the 
hypercube. Indeed, G^0 k ~ l differs in one bit only from both G^ mod 2i 0 fe-i 
and G^ l \ , o! 0 fc_i , and, if l > 0, G^0 k ~ l differs in at most one bit from 

i—l mod 2‘ ’ ’ ’2 

/^(^ _ i+i 
° Li/2J U 

Theorem 7. Given a locally normal algorithm A for a toroidal X-tree of size 
m = 2 k+1 — 1, a k-dimensional hypercube can simulate A with constant slow- 
down. 

Consequently, an n-node hypercube can also solve the segmented scan prob- 
lem in (9(maxj log | Bj\) time. 

In order to extend the previous result to simulate an r x c mesh of toroidal 
X-trees, with r = 2 kl and c = 2 fe2 , map the *-th node of the l - th level of the 
row r', r' = 0, . . . ,r - 1, into the node of index 0 k2 ~ l of the 2^ +k ^ 

node hypercube. Similarly, map the i-th node of the Z-th level of the column d , 
c' = 0, . . . , c — 1, into the node of index G^Q kl ~ l G^ of the hypercube. Note 
that, by using this mapping, a single node of the hypercube may have to simulate 
the action of two nodes of the mesh in the same step, even if the algorithm is 
locally normal. Of course, this fact causes a slow-down of 2, at most, which is 
still constant. 

Corollary 2. Given a locally normal algorithm A for an r x c mesh of toroidal 
X-trees, with r = 2 kl and c = 2 fc2 ; a (ZC 1 +/C 2 )- dimensional hypercube can simulate 
A with constant slow-down. 

Finally, by combining Corollary 0 and Theorem EH the following theorem is 
also proved. 

Theorem 8. Given an algorithm A for an r x c basic reconfigurable mesh, with 
r = 2 kl and c = 2 k2 , a (fci + fc 2 )- dimensional hypercube can optimally simulate 
A with constant slow-down, using the log-time delay model. 
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Abstract. A significant shortcoming of causal message ordering systems 
is their inefficiency because of false causality. False causality is the result 
of the inability of the “happens before” relation to model true causal 
relationships among events. The inefficiency of causal message ordering 
algorithms takes the form of additional delays in message delivery and 
requirements for large message buffers. This paper gives a lightweight 
causal message ordering algorithm based on a modified “happens before” 
relation. This lightweight algorithm greatly reduces the inefficiencies that 
traditional causal message ordering algorithms suffer from, by reducing 
the problem of false causality. 



1 Introduction 

In a distributed system, causal message ordering is valuable to the application 
programmer because it reduces the complexity of application logic and retains 
much of the concurrency of a FIFO communication system. Causal message or- 
dering is defined using the “happens before” relation, also known as the causality 
relation and denoted — >, on the events in the system execution EJ. For two 
events el and e2, el — > e2 iff one of the following conditions is true: (i) el and 
e2 occur on the same process and el occurs before e2, (ii) el is the emission of 
a message and e2 is the reception of that message, or (iii) there exists an event 
e3 such that el — » e3 and e3 — » e2. 

Let Send(M) denote the event of a process handing over the message M to 
the communication subsystem. Let Deliver(M) denote the event of M being 
delivered to a process after it is been received by its local communication sub- 
system. The system respects causal message ordering (CO) {3 iff for any two 
messages Mi and M -2 sent to the same destination, (Send(M\) — > Send(M 2 )) 
=> ( Deliver (Mi ) — > Deliver (M 2 )) . In Figure HJ causal message ordering is 
respected if message Mi is delivered to process P 3 before message M 3 . 

When a message arrives out of order with respect to the above definition, 
a causal message ordering system buffers it and delivers it only after all the 
messages that should be seen before it in causal order, have arrived and have 
been delivered. 

Causal message ordering is very useful in several areas such as managing repli- 
cated database updates, consistency enforcement in distributed shared memory, 
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Time 

Fig. 1. Causal message ordering. 



enforcing fair distributed mutual exclusion, efficient snapshot recording, global 
predicate evaluation, and data delivery in real-time multimedia systems. It has 
been implemented in systems such as Isis 0, Transis PJ, Horus, Delta-4, Psync 
ra. and Amoeba P|. The causal message ordering problem and various algo- 
rithms to provide such an ordering have been studied in several works such as 
^1141 1(11. ‘Ill 41 Ibj which also provide a survey of this area. 

A causal message ordering abstraction and its implementation were given 
by Raynal, Schiper and Toueg (RST) |X3; - For a system with n processes, 
the RST algorithm requires each process to maintain a n x n matrix - the 
SENT matrix. SENT[i, j] is the process’s best knowledge of the number of 
messages sent by process Pi to process Pj. A process also maintains an ar- 
ray DELIV of size n, where DELIV[k\ is the number of messages sent by 
process Pk that have already been delivered locally. Every message carries pig- 
gybacked on it, the SENT matrix of the sender process. A process Pj that 
receives message M with the matrix SP piggybacked on it is delivered M only 
if, Vz, DELIV[i\ > SENT[i,j], Pj then updates its local SENT matrix SENTj 
as: ViVj, SENTj[k,l] = max(SENTj[k,l],SP[k,l]). Several optimizations that 
exploit topology, communication pattern, hardware broadcast, and underlying 
synchronous support are surveyed in m which also identifies and formulates 
the necessary and sufficient conditions on the information for causal message 
ordering and their optimal implementation. 

Cheriton and Skeen |E| pointed out several drawbacks of the causal mes- 
sage ordering paradigm and these were further discussed in mm. The most 
significant was that every implementation of a CO system has to deal with 
false causality. False causality is the insistence of the system to impose a partic- 
ular causality ordering of events even though the application semantics do not 
require such an ordering. 




Fig. 2. False causality. No semantic causal dependence between Send(Mi) and 
Send(M 3 ). 
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In Figure^! Send(Mi) — * Send(M 3 ). Assume that Send(Mi) and Send(M 3 ) 
are not causally related according to the application semantics. Now suppose M 3 
reaches P 3 before M i does. In a causal message ordering system, it is required 
that Deliver {M\) — > Deliver{M 3 ) and hence M 3 is buffered till M\ is re- 
ceived and delivered. Send(Mi) and Send{M 3 ) are not semantically related and 
semantically P 3 s behavior does not depend on the order in which it receives 
and is delivered Mi and M 3 . Hence, buffering of M 3 is really unnecessary. This 
buffering is wasteful of system resources and the withholding of message delivery 
unnecessarily delays the system execution. 

This paper addresses the topic of reduction of false causality in causal mes- 
sage ordering systems. We propose an algorithm in which the incidence of false 
causality is much lower than in a conventional causal message ordering system 
that is based purely on the “happens before” relationship. 

The notion of false causality arising from the “happens before” relation itself 
has been identified in several other contexts earlier, even before Cheriton and 
Skeen pointed out its drawbacks in the performance of causal message ordering 
algorithms. When Lamport defined the “happens before” relation — > nn, he 
had pointed out that “el — > e2 means that it is possible for event el to causally 
affect event e2” but el and e2 need not necessarily have any semantic depen- 
dency. Fidge proposed a clock system to track true causality more accurately in 
a system with multithreaded processes 0. However, this scheme and its vari- 
ants are very expensive. Tarafdar and Garg pointed out the drawbacks of false 
causality in detecting predicates in distributed computations Ini- 

Section 0 gives the system model and a framework to design relations that 
have varying degrees of false causality. Section 0 presents a practical and easy 
to implement partial order relation based on the framework of Section 0 that 
reduces many of the false causal relationships modeled by the — > relation. 
Based on this new relation, the section then presents a lightweight algorithm 
that reduces false causality in causal message ordering. Section ^concludes. 

2 System Model 

A distributed system is modeled as a finite set of n processes communicating 
with each other by asynchronous message passing over reliable logical channels. 
There is no shared memory and no common clock in the system. We assume 
that channels deliver messages in FIFO order. A process execution is modeled 
as a set of events, the time of occurrence of each of which is distinct. An event 
at a process could be a message send event, a message delivery event, or an 
internal event. A message can be multicast at a send event, in which case it is 
sent to multiple processes. A distributed computation or execution is the set 
of all events ordered by the “happens before” relation — > HU, also defined in 
Section 0 Define el e2 as (el — > e2) V (el = e2). The j th event on process 
Pi is denoted as ej. Each process Pi has a default initialization event e°. The 
set of all events E forms a partial order ( E , — >). The causal past of an event e 
is denoted EC(e ) = {e' | e'-—> e}. The causal past of an event e, projected on 



64 



P. Gambhire and A.D. Kshemkalyani 



Pj is denoted ECj(ei) = {e' | e' — > e*}. For any event e*, define a vector event 
count ECV(e^) of size n such that ECV(e^)[j] = \EC j {e k i )\ - 1. 
gives the number of computation events at Pj in the causal past of e*\ 

False causality is defined based on the true causality partial order relation 
— —> on events; the — relation is the analog of the — > relation, that accounts 
only for semantically required causality, and is defined similar to that in HZ!. 

Definition 1. Given two events el and e2, el semantically depends on e2, de- 
noted as el — ^ e2, iff the action taken at e2 depends on the outcome of el. 



We assume that a message delivery event is always semantically dependent on 
the corresponding message send event. Furthermore, if el and e2 are on different 
processes and el — e2, then 3 M \ el Send(M) — Delivery(M) — —> e2. 
With this interpretation, is the transitive closure of a “local semantically 
depends on” relation and the “happens before” imposed by message send and 
corresponding delivery events. 

By substituting — for — > in the traditional definition of causal message 
ordering, the resulting definition will be termed “semantic causal ordering”. In 
contrast, the traditional causal message ordering will be termed the “happens 
before causal ordering” . We will also refer to the causal message ordering problem 
as simply the causal ordering problem. 

If the only ordering imposed on events is to respect the semantic causality 
relation defined above, then there is no performance degradation due to false 
causality. In other words, if messages M\ and M 2 are sent to the same desti- 
nation, then Mi need not be delivered before M 2 if Send(Mi)-^ G Send(M 2 ), 
but Mi will be delivered before M 2 if Send(Mi) — Send{M 2 ). This model 
defines causality among events based on the application semantics. An imple- 
mentation of this model requires that the semantic causal dependencies of each 
event be available. Existing programming language paradigms do not permit 
such specifications and neither does an API for such a specification seem prac- 
tical. It is possible for a compiler to extract such information by analyzing data 
dependencies. Alternatively, analogous to Fidge |Zj and Tarafdar-Garg H3 , we 
can assume that this information is computable using techniques given in 7!j. 
In fact, as we will show, the causal ordering algorithm we propose requires only 
that the following information is available: for any event, information about the 
most recent local event on which the event has a semantic dependency. 

Definition 2. Given two events el and e2, el weakly causes e2, denoted as 
el e2, iff el -f^> e2 A el — > e2. 

For a computation ( E , — >) and a complete specification of the true causality 
(E, -—>) in the computation, the amount of false causality is the size of the 
relation, which is — >(E x E) \ -—>(E x E). 

Ideally, it is desired to implement the relation, and have — W be the 
empty relation on E x E. The difficulties in having the programmer specify the 
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— > relation are given above. Though a compiler or an alternate mechanism can 
identify the exact set of all local and nonlocal events that semantically precede 
the current event to implement true causality, the overhead of tracking such a set 
of events as the computation progresses is nontrivial. Therefore, our objective 
is to approximate — at a low cost to make an implementation practical, and 
minimize the size of the — * relation on E x E. To this end, we introduce the 
vector MCV to track the latest event at each process such that if the happens- 
before causal order for a message presently sent is enforced only with respect to 
messages sent in the computation prefix identified by such events, then semantic 
causal order is not violated. The vector MCV naturally identifies a computation 
prefix denoted MC . Thus, for a message sent at any event e, the vector MCV(e) 
ensures that if happens-before causal ordering is guaranteed with respect to mes- 
sages sent in MC(e), then semantic causal ordering is guaranteed with respect 
to all messages sent in the causal past of the event e. 

Definition 3. For any event e, define vector MCV(e ) and set MC(e) to have 
the following properties. 

— The maximum causality vector MCV(e) is a vector of length n with the 
following properties: 

• (Containment:) Vj, MCV(e)[j\ < ECV(e)\j], and 

• (Semantic dependency satisfaction:) ef => l < MCV(e^)[j] 

— MC(e) = {e) | e'- — » ^ e ^}, i.e., MC(e) is the computation prefix 

such that the latest event of this prefix at each process Pj is MCV(e)\j]. 



We now make the following Proposition [D which holds because there is no 
event that is not in MCV(e)[j] that semantically precedes the event e. 

Proposition 1. For any event e, if every message sent by each Pj among its 
first MCV(e)[j] events is delivered in happens-before causal order with respect 
to any messages sent at event e, then every message sent in ECV(e) is delivered 
in semantic causal order with respect to any messages sent at event e. 

Observe that in general, there are multiple values of MCV that will satisfy 
Definition El Any formulation of M CV consistent with Definition El can be used 
by a causal ordering implementation to reduce false causality. Clearly, different 
formulations of MCV reduce the false causality to various degrees and can be 
implemented with varying degrees of ease. Two desirable properties of a good 
formulation of MCV are: 

— It should eliminate as much of the false causality as possible. 

— It should be implementable with low overhead. 

Happens-before causal ordering of a message sent at event e needs to be 
enforced only with respect to messages sent in the computation prefix MC(e). 
To implement this causal ordering, for each event e*\ process Pi needs to track 

the number of messages sent by each process Pj up to event e ^ cv Ci ^ to every 
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other process. As observed by process Pi at ef, the count of all such messages 

sent by each Pj up to e ^ rcv< ' ei ^ to every other process Pi can be tracked by a 
matrix SENT[l ... n, 1 ... n], where SENT[j. Z] is the number of messages sent 

by Pj up to e ^ CV!e ' ^ to Pi. Analogously, we track the count of all messages 

sent by each Pj up to e ^ Gv ^ ei ^] to each other process Pi, by using the matrix 
SENT_ECV[l...n,l...n\. 

For a traditional causal ordering system, MCV(e) = ECV(e) and this im- 
plementation exhibits the negative effects caused by the maximum amount of 
false causality. Note that here SENT = SENT _ECV and the matrix SENT 
as we have defined then degenerates to the matrix SENT as defined in [ HTIl 

3 Algorithm 

3.1 Preliminaries 

We propose the following formulation of vector MCV[e), that is easy to imple- 
ment and gives a good lightweight solution to the false causality problem. The 
definition uses the max function on vectors, which gives the component-wise 
maximum of the vectors. 

Definition 4. 1. Initially, Vi, MCV(e ?) = [0, . . . , 0] . 

2. For an internal event or a send event e k , 

MCV[e k )= ECV[e’i) if 3 e? | MCV^M <p< ECV(e k )[q\ A e q e k it 
MCV(e k ~ 1 ) otherwise 

3. For a delivery event e k of a message sent at e r m , 

MCV{e k i)= max(MCV(e k - 1 ),MCV{e r m )) 

Observe that the only way e k , where q ^ i, is if 3 e k , e v q such that 

e k ‘ — ef, ef = Deliver(M), e?' = Send(M), and — % e v q . Also observe 

that the MCV of an event is the max of the ECV s of one or more events in 
its causal past, and therefore, analogous to EC, if Deliver(M) belongs to the 
MC of some event, then Send(M) also belongs to the MC of that event. Based 
on the above two observations, it is possible to simplify the condition test in 
Definition as follows. 

Definition 5. 1. Initially, Vi, MCV{e®) = [0, . . . , 0] . 

2. For an internal event or a send event e k , 

MCV (e k ) = ECV (e k ) if 3 ef | AfC'V'(eJ'- 1 )[t] < p < k A e\ -U e k , 
MCV(e k ~ 1 ) otherwise 

3. For a delivery event e k of a message sent at e r m , 

MCV{e k i )= max{MCV{e k i ~ t ),MCV{e r m )) 
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Informally, the above formulation of MCV identifies the following events 
at each process. For any event ef, (I) MCV(ef)\j], j ^ i, identifies the latest 
event e j such that some event at Pi occurring causally after ej and in ECfe^) 
depends semantically on ej; (II) If e\ depends semantically on some event at Pi 
that occurs after MCV(e^~ 1 )[i], then MCV^e 1 )) identifies e* at Pi; otherwise it 
identifies MCV at Pi. Causal ordering of a message sent at e\ needs to 
be enforced only with respect to messages sent in the computation prefix up to 
these identified events. 

Lemma 0 shows that this formulation of MCV is an instantiation of Defi- 
nition 0 Specifically, it states that each event that semantically happens before 
belongs to the computation prefix up to the events indicated by MCV(e 1 f). 

Lemma 1 . Definition 0 satisfies the “Containment ” property and the “Seman- 
tic Dependency Satisfaction” property of MCV described in Definition H i.e., 
(Vj, MCV(e)\j\ < ECV(e)\j]) and e] -U e* =» l < MCV(e^)\j\. 

The formulation of MCV in Definition 0is easy to implement because we 
can observe the following property. 

Property 1. From Definition 0 it follows that 

1. At a send event e’f, determining MCV (ef) requires MCk^ef -1 ) and iden- 
tifying the most recent local event on which there is a semantic dependency 
of e 1 ). This information can be stored locally at a process. 

2. At a delivery event e * of a message sent at event e r m , determining MCV (e 1 )) \j) 
requires MCV(e } f~ 1 )\j] and MCV(e r m )\j\. MCV(e r m )\j] can be piggybacked 
on the message sent at e r m . 

The following property based on Definition 0 implies that the entries in the 
SENT matrix at a process are monotonically nondecreasing in the computation. 

Property 2. MCV(e } f)[j\ > MCF(e^ 1 )[j] i.e., MCV(e)[j\ is monotonically 
nondecreasing at any process. 



3.2 Algorithm to Reduce False Causality 

Definition 0of MCV is realized by the algorithm presented in Figure 0 This 
algorithm is based on the causal ordering abstraction of Raynal, Schiper and 
Toueg ca because it provides a convenient base to express the proposed ideas. 
The proposed ideas can be superimposed on more efficient causal ordering algo- 
rithms such as those proposed and surveyed in B33- 

Recall that in E), each send event and each delivery event updates the 
SENT matrix to reflect the maximum available knowledge about the number of 
messages sent from each process to every other process in the causal past. This 
implies that any message M has to be delivered in causal order with respect 
to all the messages sent in the causal past of the event Send(M ), even though 
there may be no semantic dependency between M and these messages. This is 
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a source of false causality which can be minimized by the lightweight algorithm 
proposed here. 

Each process Pi maintains the data structures DELIVi , SENTJJONCi and 
SENT_PREVi , described below. 



— DELIVi '■ array [1 . . n] of integer. 

DELIVi[j] is Pi s knowledge of the number of messages from process Pj that 
have been delivered to Pi thus far. It is initialized to zeros. 

— SENT-PREVi'. array[l..n,l..n] of integer. 

SENT_PREVi[j,l\ at e k is Pi s knowledge of the number of messages sent 

from Pj to Pi up to the event )[.;] . ^ j g i n iti a liz e d to zeros. 

— SENTJJONCi : array[l..n,l..n] of integer. 

SENT_CONCi[j,l\ at e k is PJ s knowledge of the number of messages sent 

from Pj to Pi after the event e ^ lcv ^ ei ' , ^\ it is initialized to zeros. 



Recall that the nxn matrix SENT.ECV at ef, defined in Section ^gave the 

count of all messages sent by each Pj up to ef CV ^ Ci ^ to each other process Pi, in 
SENT_ECV[j,l}. The two matrices SENT. P REV and SENT .CON C at any 
process have the invariant property that SENT .PREVjj, l] + 

SENT.CONCJj, l } = SENT_ECVi[j, Z], as will be shown in Lemma QThe row 

SENT .PREVjj, •] reflects the row SENT .ECVjj, •] up to the event e ^ lcv ^ ei 
whereas the row SENT.CONCJj, ■} reflects the row SENT-ECVjj, ■} after 
that event. The challenge is to maintain the SENT _PREV and SENT.CONC 
matrices at a low cost so as to retain the above property. 

The causal ordering algorithm that minimizes false causality is given in 
Figure g] At a message send event, steps (E1)-(E7) are executed atomically. 
Step (El) determines based on the sender’s semantics if the message being 
sent, and implicitly all future messages sent in the system, should be deliv- 
ered in causal order with respect to all messages sent so far, i.e., whether 
MCV(e k ) = ECV(e k ). Specifically, the test should.Semantically. Precede uses 
two inputs: MCV{e k i _1 ) and the latest local event on which there is a local de- 
pendency (Property 0 . These two inputs are used to check for Definition EP,i.e., 

to determine whether there exists an event ef such that e ^ ICV ( e * ^ — > e P /\ JP 
e k , in which case MCV(e k ) will be greater than MCV(e k ~ 1 ). If MCV(e k ) 
is greater than MCV(e k ~ 1 ), then the condition should.Semantically.Precede be- 
comes true and MCV (e k ) should be set to ECV (e k ). In this case, steps (E2)-(E5) 
update the matrices SENT JPREV and SENT.CONC to reflect this; otherwise 
SENT.PREV and SENT.CONC are left unchanged. Step (E6) sends the mes- 
sage with the two matrices piggybacked on it. Step (E7) updates SENT.CONC 
to reflect the message(s) just sent. 

When a message M , along with the sender’s SENT.PREV and 
SENT.CONC matrices SP and SC piggybacked on it, is received by a pro- 
cess, M can be delivered only if the number of messages delivered locally so far 
is greater than or equals the number of messages sent to this process as per SP 
(step (Rl))- Steps (R2)-(R8) are executed atomically. The message gets deliv- 
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Data structures at Pi 

Dl. DELIVi'. array [1 . . n] of integer 

D2. SENT-PREVi'. array [l..n,l..n] of integer 

D3. SENT-CONCi : array [l..n,l..n] of integer 

Emission of message M from Pi to Pj 
El. if ( should _Semantically .Precede) then 
E2. for k = 1 to n do 
E3. for l = 1 to n do 

E4. SENT-PREVi [k, l ] = SENT_PREVi [k,l] + SENT.CONCi [k, 1} 

E5. SENT_CONCi[k,l] = 0 

E6. Send(M, SENT-PREVi, SENT-CONCi ) 

E7. SENT-CONCi[i,j] = SENT-CONCi[i, j] + 1 

Reception of (M, SPm, SC m) at Pj from Pi 
Rl. Wait until (\/k, SP M [k, j] < DELIVj[k]) 

R2. Deliver M 

R3. SCm[i, j] * SCM[i,j] + 1 
R4. DELIVj[i\ = DELIVj [i] + 1 
R5. for k = 1 to n do 
R6. for Z = 1 to n do 

R7. SENT CONC :j [k,l\ = max(SENT-CONCj[k,l] + SENT-PREVj[k,l], 

SP M [k,l\ + SC M [k,l])- max(SENT-PREVj[k,l\, SP M [k,l\) 

R8. SENT_PREVj [k, 1} = max{SENT_PREVj [k, l ] , SP M [k, Z]) 

Fig. 3. Causal message ordering algorithm to minimize false causality. 



ered in step (R2). Steps (R3)-(R4) update the data structures SC and DELIV 
to reflect that the message was sent and has now been delivered, respectively. 
Steps (R5)-(R8) update the data structures SENT-CONC and SENT -PREV . 
These steps ensure that SENT -PREV reflects the maximum knowledge about 
the messages that were sent by events in the MC of the current event, while 
SENT_CONC reflects the maximum knowledge about the messages that were 
not sent by events in the AdC of the current event. 

3.3 Correctness Proof 

We state some lemmas and the main theorem that prove the correctness of the 
algorithm. See the full paper for details 0. 

Lemma 0 gives the invariant among SENT -PREV , SENT-CONC and 
SENT -ECV at any event. 

Lemma 2. SENT-PREVi + SENT-CONCi = SENT -ECV 

Lemma El states that the SENT -PREV matrix reflects exactly all the mes- 
sages sent by various processes up to MCV(e). This includes all the send events 
that semantically precede all local events up to the current event. 

Lemma 3. The messages sent by Pj until )b1 correspond exactly to the 

messages represented by SENT_PREVi[j, •] at e 
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Lemma EJstates that messages represented by the matrix SENT CONC are 
the messages sent concurrently in terms of semantic dependency with respect to 
the current event. These are the messages with respect to which no “happens- 
before” causal ordering needs to be enforced in order to meet the semantic causal 
ordering requirements. 

Lemma 4. The messages represented by SENT_CONCi[j, ■] at e \ correspond 
exactly to the messages sent in the left-open right-closed duration (MCV(e^)[j},~ 
ECV(e$)[j]}. 

Theorem 1. The algorithm given in Figured implements causal message or- 
dering with respect to the relation — 



3.4 Algorithm Analysis 

With the proposed approach, any message M multicast at event e* should be 
delivered in causal order with respect to all the messages represented by SENT 
PREV , i.e. , sent in the computation prefix MC(ej). Messages represented by 
SENTJJONC are those with respect to which no false causality is imposed 
(follows from Lemma El and Proposition 0) ■ The amount of false causality in 
enforcing causal delivery of M is the size of -^>(MC(ej) x MC(ej)), in contrast 
to the traditional causal ordering where the amount of false causality is the 
size of EC(ei ) x ECfe)). While —*(MC(ei) x MC(e.i)) may still be large 

(although smaller than that for the traditional approach), indicating that much 
false causality may still exist, this is not so on close analysis. With the proposed 
approach, false causality is potentially imposed only with respect to some of the 
messages sent up to MCV . Each MCV(e^)\j\ will usually be much less than 
ECV{&l)[j]. Hence, false causality is potentially imposed only with respect to 
some of the messages sent in the more distant past. In practice, message delivery 
times tend to have an exponential distribution. Hence, messages sent in the 
distant past up to the events indicated by MCV(e^) would have been delivered 
with high probability and the present message could most likely be delivered as 
soon as it arrives and without any buffering. Only in the case that some message 
sent in the distant past has not been delivered, and a false causality exists on 
such a message, that the proposed algorithm will unnecessarily delay the present 
message and require some buffering. We expect that such a case will have a low 
probability of occurrence, and when it occurs, the extra delay incurred by the 
imposed false causality will be small. 

The computational complexity of the proposed algorithm is the same as that 
of the RST algorithm. The 0(n 2 ) extra computation in steps (R5)-(R7) is the 
same order of magnitude as in the RST algorithm. The 0(n 2 ) extra computation 
in steps (E1)-(E5) is of the same order of magnitude as in the RST algorithm. 
In terms of space complexity, the RST algorithm requires n 2 x m bits of storage 
space and message overhead, where to is the size of the message counter, whereas 
the proposed algorithm requires 2 n 2 x to bits of storage space and message 
overhead, which are comparable. 
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The only additional overhead is to implement Definition 0 By Property H 
this requires the MCV vector of the previous local event and the identity of 
the latest local event on which there is a semantic dependency. The former 
information is already computed; the latter can be assumed to be available as 
in jnm or can be extracted from compiler data. 

4 Concluding Remarks 

False causality in causal message ordering reduces the performance of the sys- 
tem by unnecessarily delaying messages and requiring large buffers to hold the 
delayed messages. We presented an efficient algorithm for implementing causal 
ordering that eliminates much of the false causality. In particular, the algorithm 
eliminates the false causality in the near past of any message send event. Some 
false causality with respect to messages sent in the more distant past exists. It is 
expected that such false causality will have a minimal degradation on the perfor- 
mance of the ideal causal ordering implementation because messages sent in the 
more distant past will have been delivered with high probability and cause buffer- 
ing of the present message with low probability. The implementation presented 
here is lightweight and requires about the same order of magnitude overhead as 
the baseline RST algorithm, with minimal additional support. 
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Abstract. Recently, many different protocols have been proposed for software 
Distributed Shared Memory (DSM) that can provide a shared-memory pro- 
gramming model for distributed memory hardware. The adaptive protocols of 
these protocols attempt to allow the system to choose between different proto- 
cols based on the access patterns it observes in an application. This paper de- 
scribes several problems that deteriorate the performance of a hybrid proto- 
col[6], an adaptive invalidate/update protocol. To address these problems, this 
paper then presents a working-set based adaptive invalidate/update protocol that 
uses a working-set model as the criteria for determining whether to update or 
invalidate. The proposed protocol was implemented in CVM [7], a software 
DSM system, and evaluated using eight nodes of an IBM SP2. After experi- 
menting with various working-set window sizes, it was confirmed that the pro- 
posed protocol could track an access pattern better than the hybrid protocol, 
plus with a very small window size the protocol was able to optimize the over- 
all performance. 



1 Introduction 

Software distributed shared memory systems (DSM) provide programmers with the 
illusion of shared memory on top of message-passing hardware [2]. These systems 
provide a low-cost alternative for shared-memory computing, since they can be built 
using standard workstations and operating systems. Although many different proto- 
cols have been proposed for implementing a software DSM [1,4,9], the relative per- 
formance of these protocols is application-dependent: the memory access patterns of 
the application determine which protocol will produce a good performance. Accord- 
ingly, it would be interesting to build a system with multiple protocols, and allow the 
system to choose between the different protocols based on the access patterns it ob- 
serves in a particular application [1,6,9,10,12]. 

The lazy hybrid (LH) protocol [5,6], an adaptive invalidate/update protocol, is a 
lazy protocol similar to the lazy release consistency (LRC) [5] using an invalidate 
protocol, however, instead of invalidating the modified pages, it updates some of the 
pages at the time of synchronization. The decision on updating a page depends on 
whether or not the target processor has accessed the page before. 

This paper first describes three problems that deteriorate the performance of the 
LH protocol as follows: (i) the more diffs that are updated, the longer latency of the 
synchronization, as a result, the overall performance can be degraded, (ii) the protocol 
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continues to apply the update protocol unnecessarily to pages accessed once, yet not 
accessed later, and (iii) on lock synchronization, a release processor is unable to de- 
termine the access pattern that an acquire processor generated in the latest interval. 

To cope with these problems, a working-set model is proposed as the criteria for 
determining whether to update or invalidate, and consequently, a Working-set based 
Adaptive invalidate/update Protocol (WAP) is presented. In a conventional operating 
system, the working-set model has been used for a demand paging subsystem that 
permits greater flexibility in mapping the virtual address space of a process into the 
physical memory of a machine [11], For this model, Denning formalized the working- 
set of a process, which is the set of pages that the process has referenced in its last A 
memory references; the number A is called the window of the working-set. 

WAP attempts to exploit spatial locality by assuming that a process tends to local- 
ize its references to the working-set and this set only changes slowly. Therefore, when 
sending the working-set with a lock acquire message, WAP allows the releasing proc- 
essor to know the latest access pattern of the acquiring processor and only updates the 
pages in the working-set. As a result, WAP attempts to constrain the number of diffs 
via updates, thereby optimizing a tradeoff between the number of diffs via updates 
and the latency of the synchronization. In addition, WAP can propagate the access 
pattern of a processor earlier than LH, and therefore, has more chances to update. 



2 Background 

This paper focuses on page-based software DSM systems that use a multiple writer 
protocol and lazy release consistency (LRC) protocol, such as TreadMarks [4] and 
CVM [7]. In multiple writer protocols [1], write detection is performed by twinning 
and diffing. On the first write to a shared page, an identical copy of the page (a twin ) 
is made. The twin is then compared with the modified copy of the page to generate a 
diff. a record of the modifications to the page. This diff is then returned in response to 
a page fault request. 

The LRC [5] protocol, which is a release consistency [3] implementation, delays 
the propagation of shared memory modifications by processor p to processor q until q 
executes an acquire corresponding to a release by p. On an acquire operation, the last 
releaser can determine the set of write notices that the acquiring processor needs to 
receive, i.e. the set of notices that precede the current acquire operation in the partial 
order. A write notice is an indication that a page has been modified within a particular 
interval, however, it does not contain the actual modifications. Upon receiving the 
notices piggybacked on the lock grant message, the acquirer then causes the corre- 
sponding page to be invalidated. Access to an invalidated page causes a page fault. At 
this point, the faulting processor must then retrieve and apply to the page all the diffs 
that were created during the intervals that preceded the faulting interval in the partial 
order [3]. 
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3 Lazy Hybrid Protocol and Its Limitations 

3.1 Lazy Hybrid (LH) Protocol 

The LH [6], an adaptive invalidate/update protocol, is a lazy protocol similar to LRC 
using an invalidate protocol, however instead of invalidating the modified pages, it 
updates some of the pages at the time of an acquire, LH attempts to exploit temporal 
locality by assuming that any page accessed by a processor in the past will probably 
be accessed by that processor again in the future. Therefore, all pages that are known 
to have been accessed by the acquiring processor are updated. 

Each processor uses copysets [5] to track the page accesses by other processors. At 
a synchronization point, the copyset is used to determine whether or not a given diff 
must be sent to a remote location. For each write notice to be sent, if the releasing 
processor has a diff corresponding to the write notice and the acquiring processor is in 
the local copyset of that page, that diff will be appended to the lock grant message. 

On arrival at a barrier, each processor creates a list describing any local write no- 
tices that may not have been seen by the other processors. A list for processor p. at 
processor /; consists of processors p’s notion of all the local write notices that have 
not been seen by p. p. sends an update message to p containing all the diffs corre- 
sponding to the write notices in this list. 



3.2 Limitation of LH Protocol 

After executing several applications using the LH protocol and tracing their behav- 
iors, several problems were identified that deteriorate the performance of the LH 
protocol. First, under LH, the increment in the amount of diffs updated increases the 
latency of the synchronization, because LH updates the diffs through a synchroniza- 
tion message. Even when the all diffs sent via updates were used by the target proces- 
sor, the increased excessively latency of the synchronization still tended to deteriorate 
the overall performance. 

Second, LH updates unnecessary diffs in some applications that have a migratory 
access pattern. These applications access each page only once or twice for their entire 
life, and access the majority of entire pages. In this case, LH continues to unneces- 
sarily apply the update protocol to the pages that are only accessed once and never 
accessed later. 

Third, on lock synchronization, a releasing processor’s copyset cannot reflect the 
access pattern that an acquiring processor has generated in the recent interval. Since 
LRC only requires that the acquiring processor knows (or receives) the events of the 
memory updates (or notices) that precede the current acquiring operation in the partial 
order, the releasing processor cannot know the latest access events of the acquire 
processor. As a result, LH cannot help but refer an access history of older intervals, 
even though the latest events are the most important from the viewpoint of the con- 
ventional operating system. 

In light of these problems, a new protocol is proposed that uses a working-set 
model to track an access pattern efficiently, as described in the next section. 
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4 Working- Set Based Adaptive Protocol 

The proposed protocol, the Working-set based Adaptive Protocol (WAP), still updates 
some of the pages at the time of an acquire, like LH. However, WAP uses the work- 
ing-set of a processor to determine whether to update or invalidate the page, whereas 
LH uses the copysets of the pages. 

Each process maintains the working-set using a circular queue, where the size of 
the queue, A, is the size of the working-set window. Whenever not only a write fault 
but also a read fault occur, the protocol inserts that page number into the queue only if 
the current page number is different from that of the queue’s rear, that is, the latest 
page number inserted. The reason for this is to manage the queue effectively against a 
prevalent access pattern that will generate a write fault to a page immediately after 
generating a read fault. On sending an acquire message, the acquiring processor ap- 
pends the content of the queue into the message. Although the system must remove 
any redundant page numbers to achieve an ideal working-set, the entire content of the 
queue is appended in order to minimize the time overhead to do it. 

The releasing processor creates a list describing the local write notices that may not 
have been seen by the acquiring processor. For each write notice in the list, if the re- 
leasing processor has the diff and the page of the notice is included in the working-set 
delivered just previously, the diff is appended to the lock grant message. 

On arrival at a barrier, each processor except for the barrier master creates a list de- 
scribing the local write notices that may not have been seen by the barrier master 
processor, and then appends the diffs of a notice into a barrier arrival message only if 
the notice was created by oneself and the notice’s page is included in the latest work- 
ing-set sent before by the master (this former condition can eliminate any redundant 
diff transmissions.). Also, the processor’s working-set is appended. After all the bar- 
rier arrival messages arrive at the master, the master then sends barrier start messages 
to all other processors in the same way with a lock release. 



5 Implementation and Evaluation 

5.1 Platform and Applications 

The two protocols, LH and WAP, as described in the previous section, were imple- 
mented in a CVM DSM system. The experimental environment consisted of eight 
nodes on an IBM SP2 running AIX 4.1.4. Six applications were used in this paper: 
TSP from the CVM package, Barnes, Water-Nsquared, Water-Spatial, and Ocean 
from Splash-2 [8]. Table I summarizes the applications and their input sets. The val- 
ues in the table were obtained by the current study. The lock value was the number of 
lock acquiring messages sent to remote processors, and not the total count of lock 
acquires. The number of page requests shows indirectly how many pages were used 
by the application. 
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Table 1. Application Suite 
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Fig. 1. Eight-processor normalized execution times for LI, LH, and WAP 



5.2 Performance Evaluation 

A performance comparison was acquired by executing application suite using WAP 
with various window sizes, LH, and the lazy invalidate (LI) protocol. The LI protocol 
is an LRC protocol using the invalidate protocol included in CVM package. Fig. 1 
shows the normalized execution times for the six applications for each of the three 
protocols. In this figure, the execution times of WAP were obtained by selecting the 
optimal window size.A, which resulted in the minimal execution time among the other 
window sizes. The values of A was 300 for TSP, all other values were 20. 

From Fig. 2 to Fig. 6, all figures present important elements of the performance 
with each application. For WAP, the graphs show the results of executing with the 
various values of A, that is, 20, 40, 60, 80, 100, 150, 200, 250, 300, 350, 400, 450, and 
500. Graph (a) of each figure presents, in order, the execution time of the applica- 
tions, the total time to create diffs, the total time waiting for diffs after sending a diff 
request, the total time to handle faults, and the total time to handle a lock acquire and 
barrier. Except for the execution time, all other values are the summation of the times 
of eight processors. Graph (b) presents the total number of message bytes delivered 
(msg_byte), the number of messages sent to remote processors in order to request a 
diff (remote_diff), the number of diffs created in the system (diff_creadted), the 
number of diffs created for updates (update_diff_creadted), the number of diffs 
delivered by updates (update_diffs_sent), and the ratio of diffs used by the system 
to diffs delivered by an update (hit ratio). 
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TSP. With WAP, as the value of the window size, A, increases up to A=250, the num- 
ber of diffs updated (update_diffs_sent) increases. Therefore, WAPs with A>=150 
send more diffs via updates than LH and send fewer remote diff requests (re- 
mote_diff) than LH. Consequently, the number of remote diff requests with WAP 
when A=500 is less than half that of LH. 

This result shows that since WAP causes the acquiring processor to send the latest 
access pattern to the releasing processor, WAP is able to adapt successfully to the 
access pattern of this application and thereby gains more chances to update than LH. 
However, although the overhead of WAP, related to the working-set transmission 
caused by an acquire, produces an increase in the total message bytes (msg_byte), 
fortunately, this overhead does not result in a longer lock latency. Nevertheless, with 
a large A value (A>300) the WAP performance is only marginally better than that of 
LH, since TSP has a very high computation to communication ratio. 

Barnes. It is interesting to note that when A=500, although WAP generates a similar 
number of remote diff requests as LH, it sends about 50% fewer diffs via updates and 
achieves a 10% higher hit-ratio than LH. This result, also, shows that WAP exhibits a 
better adaptation ability to the access pattern than LH, as in TSP. However, in contrast 
to TSP, the overhead of WAP with a large A value results in much longer lock la- 
tency, and thereby a worse overall performance. The reason for this is the frequent 
lock synchronizations related to the very small size (close to zero) of the critical sec- 
tion, which then produces the WAP overhead. Consequently, when A=20, WAP per- 
forms 8% better than LH and 5% better than LI. 

Ocean. For this application, although LH applies updates for the majority of the pages 
and reduces the remote diff requests by 85% compared with LI, only 66% of the diffs 
out of all the diffs updated by LH is used. As a result, updating unnecessary diffs 
results in longer synchronization latency. Consequently, the LH performance is much 
worse than that of LI. In contrast, with all values of A, WAP can constrain the number 
of diffs updated. Although WAP updates substantially fewer diffs than LH, the hit- 
ration of the updated diffs increases by about 30-36% and the synchronization latency 
is less than that of LH. In conclusion, when A=20, WAP performed 28% better than 
LH and 9% better than LI. 

Water-Spatial. With WAP, as the value of the window size. A, increases up to 
A=250, the number of diffs updated also increases and the number of remote diff re- 
quests decreases. When A=250 WAP is able to update 50% more diffs than LH, and 
therefore generates 20% fewer remote diff requests. The fact that WAP can reduces 
the number of remote diff requests compared to LH even though LH achieves 100% 
hit-ratio, indicates the superior ability of WAP in tracking the access pattern of an 
application. When A=20, WAP performs 15% better than LH and 10% better than LI. 

Water-Nsquare. With WAP, when A>60, the number of diffs updated does not in- 
crease. It is interesting to note that, when A=60, the number of remote diff requests of 
WAP is nearly equal to that of LH, even though the number of diffs updated by WAP 
is only 77% of that of LH. The reason for this is that WAP sends fewer unnecessary 
diffs via updates than LH. When A=20, WAP performs 11% better than LH and 5% 
worse than LI. 
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6 Conclusion 

A new adaptive update/invalidate protocol based on a working-set model was pre- 
sented in this paper. Although a unique optimal windows size value could not be se- 
lected for all applications, however, for most applications, with the use of a very small 
window size value (A=20 or 40), the proposed protocol (WAP) outperformed both LH 
and LI. The reason for this was that, even with such small window sizes, the proposed 
protocol was able to reduce the number of remote diff requests caused by page faults 
from LI, and reduce the latency of synchronization from LH through constraining the 
number of diffs via updates. Furthermore, for some applications, since the proposed 
protocol could to track the access pattern better than LH, WAP was able to either re- 
duce the number of unnecessary diffs via updates, or create more chances for updat- 
ing. 

Experimental results highlighted a weakness in the proposed protocol, that is, the 
overhead caused by sending the working-set along with a lock acquire message. Due 
to this overhead, WAP with a larger working-set tend to make the overall execution 
time to be longer, even though many other elements of performance were improved. 
In particular, some applications generated excessive lock acquires or barriers, as a 
result, the overhead caused their performance to be degraded. However, in reality, 
DSM cannot help but produce very bad speedups with all of the protocols when using 
these applications. Therefore, it would appear that the applications need to be recon- 
structed for adjustment to DSM. 
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Abstract. An optimal causal message ordering algorithm was recently 
proposed by Kshemkalyani and Singhal, and its optimality was proved 
theoretically. For a system of n processes, although the space complexity 
of this algorithm was shown to be 0(n 2 ) integers, it was expected that 
the actual space overhead would be much less than n 2 . In this paper, we 
determine the overhead of the optimal causal message ordering algorithm 
via simulation under a wide range of system conditions. The optimal 
algorithm is seen to display significantly less message overhead and log 
space overhead than the canonical Raynal-Schiper-Toueg algorithm. 



1 Introduction 

A distributed system consists of a number of processes communicating with each 
other by asynchronous message passing over reliable logical channels. There is 
no shared memory and no common clock in the system. A process execution is 
modeled as a set of events, the time of occurrence of each of which is distinct. 
A message can be multicast, in which case it is sent to multiple other processes. 
The ordering of events in a distributed system execution is given by the “happens 
before” or the causality relation jBJ, denoted by — >. For two events el and e2, 
el — > e2 iff one of the following conditions is true (i) el and e2 occur on the 
same process and el occurs before e2, (ii) el is the send of a message and e2 is 
the delivery of that message, or (iii) there exists an event e3 such that el — » e3 
and e3 — » e2. 

Let Send(M) denote the event of a process handing over the message M to 
the communication subsystem. Let Deliver(M) denote the event of M being 
delivered to a process after it is been received by its local communication sub- 
system. The system respects causal message ordering (CO) Q| iff for any pair of 
messages Mi and M 2 sent to the same destination, ( Send{M \ ) — > Send(M 2 )) 
=>- ( Deliver {Mi ) — > Deliver{M 2 )). 

Causal message ordering is valuable to the application programmer because it 
reduces the complexity of application logic and retains much of the concurrency 
of a FIFO communication system. Causal message ordering is useful in numerous 
areas such as managing replicated database updates, consistency enforcement in 
distributed shared memory, enforcing fair distributed mutual exclusion, efficient 
snapshot recording, and data delivery in real-time multimedia systems. Many 
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causal message ordering algorithms have been proposed in the literature. See 
[12 13 15 IBB] for an extensive survey of applications and algorithms. Causal message 
ordering has been implemented in many systems such as Isis [fJJ, Transis [1], 
Horus |0, Delta-4, Psync [0, and Amoeba [0. 

Any causal message ordering algorithm implementation has two forms of 
space overheads, viz., the size of control information on each message and the 
size of memory buffer space at each process. It is important to have efficient 
implementations of causal message ordering protocols due to their wide appli- 
cability. The causal message ordering algorithm given by Raynal, Schiper and 
Toueg 0, hereafter referred to as the RST algorithm, is a canonical solution to 
the causal message ordering problem. It has a fixed message overhead and mem- 
ory buffer space overhead of n 2 integers, where n (also denoted interchangeably 
as N) is the number of processes in the system. The Horus [0, Transis [I I, and 
Amoeba 0 implementations of causal message ordering are essentially variants 
of the RST algorithm. 

Recently, Kshemkalyani and Singhal identified and formulated the neces- 
sary and sufficient conditions on the information required for causal message 
ordering, and provided an optimal algorithm to realize these conditions [0. This 
algorithm was proved to be optimal in space complexity under all network condi- 
tions and without making any simplifying system/communication assumptions. 
The authors also showed that the worst-case space complexity of the algorithm 
is 0(n 2 ) integers but argued that in real executions, the actual complexity was 
expected to be much less than n 2 integers, the overhead of the RST algorithm. 

Although the Kshemkalyani-Singhal algorithm was proved to be optimal in 
space complexity by using a rigorous optimality proof, there are no experimen- 
tal or simulation results about the quantitative improvement it offers over the 
canonical RST algorithm. The purpose of this paper is to quantitatively deter- 
mine the performance improvement offered by the optimal Kshemkalyani-Singhal 
algorithm, hereafter referred to as the KS algorithm, over the RST algorithm. 
This is done by simulating the KS algorithm and comparing the amount of con- 
trol information sent per message and the amount of the memory buffer space 
requirements, with the fixed overheads of the RST algorithm. The results over a 
wide range of parameters indicate that the KS algorithm performs significantly 
better than the RST algorithm, and as the network scales up, the performance 
benefits are magnified. With N = 40, the KS algorithm has about 10% of the 
overhead of the RST algorithm. 

Note that the space overhead is the only metric of causal message ordering 
algorithms studied in this simulation because it was shown in [5ji that the time 
(computational) overhead at each process for message send and delivery events 
was similar for the KS algorithm and for the canonical RST algorithm, namely 



0(n 2 ). 

Section ^outlines the RST algorithm and the KS algorithm. Section ^presents 
the model of the message passing distributed system in which the KS algorithm 
is simulated. Section 0shows the simulation results of the KS algorithm in com- 
parison to the results expected from the RST algorithm. Section ^concludes. 
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2 Overview of the CO Algorithms 

This section briefly introduces the RST algorithm 0 and the optimal KS algo- 
rithm 0 for causal message ordering. Both the algorithms assume FIFO com- 
munication channels and that processes fail by stopping. 



2.1 The RST Algorithm 

Every process in a system of n processes maintains a nx n matrix - the SENT 
matrix. SENT[i, j\ is the process’s best knowledge of the number of messages 
sent by process Pi to process Pj. A process also maintains an array DELIV 
of size n, where DELIV[k] is the number of messages sent by process P& that 
have already been delivered locally. Every message carries piggybacked on it, 
the SENT matrix of the sender process. A process Pj that receives message M 
with the matrix SP piggybacked on it is delivered M only if, V i, DELIV[i] 
> SENT[i,j]. Pj then updates its local SENT matrix SENTj as: VfcVZ £ 
{l,...,n}, SENTj[k,l\ = max(SENTj[k,l], SP[k,l]). The space overhead on 
each message and in local storage at each process is the size of the matrix 
SENT, which is n 2 integers. 



2.2 The KS Algorithm 

Kshemkalyani and Singhal identified the necessary and sufficient conditions on 
the information required for causal message ordering, and proposed an algorithm 
that implements these conditions. To outline the algorithm, we first introduce 
some formalisms. The set of all events E in the distributed execution (computa- 
tion) forms a partial order (E, — >) which can also be viewed as a computation 
graph: (i) there is a one-one mapping between the set of vertices in the graph 
and the set of events E, and (ii) there is a directed edge between two vertices iff 
either these vertices correspond to two consecutive events at a process or corre- 
spond to a message send event and a delivery event, respectively, for the same 
message. The causal past (resp., future) of an event e is the set {e! \ e' — > e} 
(resp., {e! | e — > e'}). A path in the computation graph is termed a causal path. 
Deliver ^(M) denotes the event Deliver (M) at process d. 

The KS algorithm achieves optimality by storing in local message logs and 
propagating on messages, information of the form “d is a destination of M” 
about a message M sent in the causal past, as long as and only as long as 
( Propagation Constraint I:) it is not known that the message M is delivered to 
d, and 

( Propagation Constraint II:) it is not guaranteed that the message M will be 
delivered to d in CO. 

In addition to the Propagation Constraints, the algorithm follows a Delivery 
Condition which states the following. A message M* that carries information “d 
is a destination of M”, where message M was sent to d in the causal past of 
Send(M*), is not delivered to d if M has not yet been delivered to d. 
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Constraint (I) and the Delivery Condition contribute to optimality as follows: 
To ensure that M is delivered to d in CO, the information “d is a destination of 
M” is stored/propagated on and only on all causal paths starting from Send(M), 
but nowhere in the causal future of Deliver d{M). 

Constraint (II) and the Delivery Condition contribute to optimality by the 
following transitive reasoning: Let messages M, M' and M" be sent to d , where 
Send(M) — > Send(M') — > Send(M ") and M' is the first message sent to 
d on all causal chains between the events Send(M) and Send(M'). M will be 
delivered optimally in CO to d with respect to (w.r.t.) M" if (i) M is guaranteed 
to be delivered optimally in CO to d w.r.t. M’ , and (ii) M' is guaranteed to be 
delivered optimally in CO to d w.r.t. M" . Condition (i) holds if the information 
“d is a destination of M” is stored/propagated on and only on all causal paths 
from Send(M), but nowhere in the causal future of Send(M') other than on 
message M’ sent to d. This follows from the Delivery Condition. Condition (ii) 
can be shown to hold by applying a transitive argument comprising of conditions 
(II) (i) and (I). To achieve optimality, the information “d is a destination of M” 
must not be stored/propagated in the causal future of Send(M') other than on 
message M' sent to d (follows from condition (II) (i) ) or in the causal future of 
Deliver d{M) (condition (I)). 

Information about a message (I) not known to be delivered to d and (II) not 
guaranteed to be delivered to d in CO, is explicitly tracked by the algorithm using 
the triple (source, destination, scalar timestamp). This information is deleted as 
soon as either (I) or (II) becomes false. As the information “d is a destination 
of M” propagates along various causal paths, the earliest event(s) at which 
(I) becomes false, or (II) becomes false, are known as Propagation Constraint 
Points PCP1 and POP 2, respectively, for that information. The information 
never propagates beyond its Propagation Constraint Points. With this approach, 
the space overhead on messages and in the local log at processes is less than the 
n 2 overhead of the RST algorithm, and is proved to be always optimal. 

The information “d is a destination of M” is also denoted as “d £ M.Dests " , 
where M.Dests is the set of destinations of M for each of which (I) and (II) 
are true. In an implementation, M.Dests can be represented in the local logs 
at processes and piggybacked on messages using the data structures shown in 
figure Q] 



type LogStruct = record 
sender : process Jd; 
clock: integer; 
numdests: integer; 

dests: array[l.. numdests] of processJd; 
end 



type MsgOvhdStruct = record 
sender: processJd; 
clock: integer; 
numdests: integer; 
numLogEntries: integer; 
dests: array[l.. numdests] of processJd; 
olog: array[l.. numLogEntries] of LogStruct; 
end 



Fig. 1 . The log data structure and message overhead data structure. 



Evaluation of the Optimal Causal Message Ordering Algorithm 



87 



The log is a variable length array of type LogStruct. Assuming that pro- 
cess_id is an integer, the size of a LogStruct structure is 3 + size(dests) integers, 
where size(X) is the number of elements in the set X. The log space overhead 
is the sum of the sizes of all the entries in the log. The amount of overhead 
on a message required by the KS algorithm is the size of the MsgOvhdStruct 
structure sent on it. The size of the MsgOvhdStruct structure can be determined 
as 4 + size(dests) + SIZE(olog ), where SIZE(X) is the sum of the sizes of all 
the entries in the set A' of LogStructs. The message and log space overheads 
are determined in this manner in our simulation system. 

3 Simulation System Model 

A distributed system consists of asynchronous processes running on processors 
which are typically distributed over a wide area and are connected by a network. 
It can be assumed without any loss of generality that each processor runs a single 
process. Each process can access the communication network to communicate 
with any other process in the system using asynchronous message passing. The 
communication network is reliable and delivers messages in FIFO order between 
any pair of processes. 

3.1 Process Model 

A process is composed of two subsystems viz., the application subsystem and 
the communication subsystem. The application subsystem is responsible for the 
functionality of the process and the communication subsystem is responsible 
for providing it with causally ordered messaging service. The communication 
subsystem implements the causal message ordering algorithm in the simulation. 
The application subsystem generates message patterns that exercise the causal 
message ordering algorithm. The communication subsystem maintains a float- 
ing point clock, that is different from any clock in the causal message ordering 
algorithm. This clock is initialized to zero and tracks the elapsed run time of 
the process. Every process has a priority queue called the in-queue that holds 
incoming messages. This queue is always kept sorted in increasing order of the 
arrival times of messages in it. 

Message structure: A message is the fundamental entity that transfers 
information from a sender process to one or more receiver processes. Each mes- 
sage M has a causal-info field, timestamp field, and a payload field. The 
causaEinfo field is just a sequence of bytes on which a particular structure 
is imposed by the causal message ordering algorithm. The RST algorithm im- 
poses a N x N matrix structure on the causal -info field. The KS algorithm 
imposes the structure given in figure Q The communication subsystem uses the 
timestamp field to simulate the message transmission times. The in-queues 
are kept sorted by the timestamp field. The information that is contained in 
a message is referred to as its payload. In a real system, this would contain 
the application-specific packet of information according to the application-level 
protocol. 
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3.2 Simulation Parameters 

The system parameters that are likely to affect the performance of the KS algo- 
rithm are discussed next. 

— Number of processes (N): While most causal message ordering algo- 
rithms show good performance for a small number of processes, a good causal 
message ordering algorithm would continue to do so for a large number of 
processes. It is hence necessary to simulate any causal message ordering algo- 
rithm over a wide range of the number of processes. The number of processes 
in the system is limited only by the memory size and processor speed of the 
machine running the simulation. On an Intel Pentium III machine with 128 
MB of RAM and the simulation framework being implemented in Java, we 
could simulate up to 40 processes. 

— Mean inter-message time (MIMT): The mean inter-message time is the 
average period of time between two message send events at any process. It 
determines the frequency at which processes generate messages. The inter- 
message time is modeled as an exponential distribution about this parameter. 

— Multicast frequency (M/T): The behavior of the KS algorithm may be 
sensitive to the number of multicasts. The ratio of multicasts to the total 
number of message sends (M/T) is the parameter on the basis of which the 
multicast sensitivity of the KS algorithm can be determined. Processes like 
distributed database updators have M/T = 100% and a collection of FTP 
clients have M/T = 0. We simulate the KS algorithm with M/T varying 
from 0 to 100%. The number of destinations of a multicast is best described 
by a uniform distribution ranging from 1 to N . 

— Mean transmission time (MTT): The transmission time of a message 
here implicitly refers to the msg. size/ bandwidth + propagation delay. We 
model this time as an exponential distribution about the mean, MTT. For 
the purpose of enforcing this mean, multicasts are treated as multiple uni- 
casts and transmission time is independently determined for each unicast. 
When a process needs to send a message, it determines the transmission 
time according to the formula Transmission-time = —MTT *ln(R), where 
I? is a perfect random number in the range [0,1]. This formulation of the 
transmission time can violate FIFO order. As most causal message ordering 
algorithms assume FIFO ordering, it is implemented explicitly in our sys- 
tem. Every process maintains an array LM of size n to track the arrival 
time of the last message sent to each other process. LM[i] is the time at 
which the last message from the current process to process P; will reach P;. 
Should the transmission time determined be such that the arrival time for 
the next message at Pi is less than LM[i\ , then the arrival time is fixed at 
(LM [i] + l)ms. LM[i] is updated after every message send to P,. 

MTT is a measure of the speed of the network, with fast networks having 
small MTTs. We have varied MTT from 50?ns to 5000?ns in these simulations 
so as to model a wide range of networks. 
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3.3 Process Execution 

All the processes in the system are symmetric and generate messages according 
to the same MIMT and M/T. The processes in a distributed system execute 
concurrently. But simulating each process as an independent process /thread in- 
volves inter-process /thread communication and the involved delays are not easy 
to control. Instead, a round-robin scheme was used to simulate the concurrent 
processes. Each simulated process is given control for a time slot of 500ms. A 
systemwide clock keeps track of the current time slot. 

When a process is in control, it generates messages according to the MIMT. 
The sender of a message determines the transmission time using MTT, adds it to 
its current clock, and writes the result into the timestamp field of the message. 
It then inserts this message into the in queue of the destination process. 

When a process gets control, it first invokes the communication subsystem. 
The communication subsystem looks at the head of its in-queue to determine if 
there are any messages whose timestamp is lesser than or equal to the current 
value of the process clock. Such messages are the ones that must have already 
arrived and hence should have been processed before/during this time slot. All 
such messages are extracted from the queue and handed over to the causal 
message ordering delivery procedure in the order of their timestamps. The causal 
delivery procedure will buffer messages that arrived out of causal order. Note that 
this buffer is distinct from the in-queue. Messages in causal order are delivered 
immediately to the application subsystem. Blocked messages remain blocked till 
the messages that causally precede them have been delivered. The application 
subsystem then gets control and it generates messages according to the MIMT. 
The messages are handed over to the communication subsystem for delivery. 

A process P, stops generating messages once it has generated a sufficient 
number of messages (see Section EJ and flags its status as completed. The sim- 
ulation stops when all the processes have their status flagged as completed. 

4 Simulation Results 

The KS algorithm was simulated in the framework presented in Section 0 The 
framework and the algorithm were implemented in Java using ObjectSpace JGL. 
The performance metrics used are the following. 

— The average number of integers sent per message under various combinations 

of the system parameters, viz., N, MTT , MIMT , and M/T. 

— The average size of the log in integers, under the same conditions. 

Simulation experiments were conducted for different combinations of the pa- 
rameters. For each combination, four runs was executed; the results of the four 
runs did not differ from each other by more than a percent. Hence, only the 
mean of the four runs is reported for each combination and the variance is not 
reported. 

For each simulation run, data was collected for 25,000 messages after the first 
5000 system-wide messages to eliminate the effects of startup. Every process Pi 
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in the system accumulates the sum of the number of integers that it sends 
out on outgoing messages. After every message send event and every message 
delivery event, it determines the log size and accumulates it into a variable L 
It also tracks mf, the number of messages sent, and mf, the number of messages 
delivered, during its lifetime. Once Pi has sent out mf = 30,000/TV number of 
messages, it flags its status as complete and computes its mean message overhead 
MMVi = Ii/rnf and its log space overhead LVi = Li/(mf + mf). These results 
are then sent to process Pq which computes the systemwide average message 
overhead Y MMVi/N and the systemwide average log space overhead Y Li/N . 
All the overheads are reported as a percentage of their corresponding determin- 
istic overhead n 2 of the RST algorithm. 

It is seen that the results for the log size overhead followed the same pattern 
as the results for the message size overhead in all the experiments. Hence, the 
log size overhead plots are not shown in this paper for space considerations. 



4.1 Scalability with Increasing TV 

RST scales poorly to networks with a large number of processes because of its 
fixed overhead of n 2 integers. Although KS algorithm has 0(n 2 ) overhead, it 
is expected that the actual overhead will be much lower than n 2 . We test the 
scalability of the KS algorithm by simulation. 



Message overhead for increasing N 
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in 
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(MTT MIMT M/T) 
(50ms 100ms 0.1) 
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(100ms 200ms 0.3) 
(100ms 200ms 0.99) 



Number of processes 



Fig. 2. Average message overhead as a function of TV 



The first three simulations were performed for (MTT, MIMT, M/T) fixed 
at <Si(50tos, 100ms, 0.1), S-2(50?ns, 400ms, 0.1) and S 3 (50?ns, 1600ms, 0.1). The 
number of processes was increased in steps of 5 starting from 5 up to 40. The 
results for the average message overhead are shown in figure l2l Observe that 
with increasing TV, the message overhead rapidly decreases as a percentage of 
RST. Note that in all these simulations, the overhead is always significantly less 
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than that of RST. For the case of 40 processes, for all the simulations, the over- 
head is only 10% that of RST. For a small number of processes, the overheads 
reported by KS are 80% of those of RST, but the overhead of RST itself is 
low for such systems. Similar results are seen for the next three simulations: 
(MTT, MIMT, M/T) fixed at S 4 (400ms, 100ms, 0.1), S5(100ms,200ms,0.3), 
and ^/(lOOms, 200ms, 0.99) (the other three curves in figure 0. The latter two 
simulations show that the improvement in overhead is unaffected by increasing 
the traffic, modeled by increasing the multicast frequency to 30% and 99%. 

It can be seen from figure El that the performance (overhead relative to the 
RST algorithm) gets better when the number of processes is increased keeping 
MTT, MIMT, and M/T constant. This is because increasing the number of pro- 
cesses implies an increase in the rate of generation of messages, given a constant 
MIMT. As MTT is held constant, all these messages reach their destinations in 
the same amount of time as with a lower number of processes. Hence, there is 
greater dissemination of log information among the processes, thereby providing 
impetus for the Propagation Constraints to work with more up to date informa- 
tion and purge more information from the logs. Thus as n increases, the logs get 
purged more quickly and their size tends to be an increasingly smaller fraction 
of n 2 , the size of logs in the RST algorithm. 

From all the simulations Si through Sq and the above analysis, it can be 
concluded that the KS algorithm has a better network capacity utilization and 
hence better scalability when compared to RST. 



4.2 Impact of Increasing Transmission Time 

Increasing MTT is indicative of decrease in available bandwidth and increasing 
network congestion. The space overheads of the RST algorithm are fixed at n 2 , 
irrespective of network congestion conditions. We ran simulations for systems 
consisting of 10, 15, and 20 processes under varying MIMT and M/T to analyze 
the impact of increasing MTT. The results for the average message overhead are 
shown in figure 0 

The first three simulations fixed (N, MIMT, M/T) at Si(15, 400ms, 0.1), 
£ 2 ( 15 , 800ms, 0.1) and 5 3 (15, 1600ms, 0.1), respectively. The MTT was increased 
from 200?ns to 4800?ns progressively in steps of 100 initially, 200 later, and 
multiples of 2 finally. The fineness of the initial samples was necessary to see that 
the overheads were growing fast initially but soon settled to a maximum. The 
overhead of the algorithm as a % of the RST overhead first increases gradually 
but soon reaches steady state despite further increases in MTT. This is explained 
as follows. At low values of MTT, message transmission is very fast and hence 
log sizes at the processes are small. However as MTT grows even slightly, the 
message transmission rate falls and the log sizes begin increasing in size. Hence 
a growth in overheads can be seen in the initial parts of the curves. However 
once MTT becomes large, all the log sizes tend to a “steady-state” proportion of 
n 2 (determined by other system parameters) but significantly less than n 2 . This 
trend is because the pruning of the logs by the Propagation Constraints is still 
effective. Also recall that the sizes of the logs are bounded t,5]; once a process Pi 
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Message overhead for increasing MTT 
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Fig. 3. Average message overhead as a function of MTT 



has a log record of a message send to process Pj, a log record of a new message 
send to Pj can potentially erase all previous log records of messages sent to 
Pj. At lower MTT, because of faster propagation of log information, pruning 
logs using the Propagation Constraints is more effective and the logs are much 
smaller. 

Note that despite an initial increase, the overhead is always significantly less 
than that of RST. For example, in simulations Si, S2, and S3, the message over- 
head is never more than 40% that of RST. The next three simulations fixed (N, 
MIMT, M/T) at 5 4 (20, 500ms, 0.3), 5 S (20, 500ms, 0.99), and 5 6 (10, 400ms, 0.1). 
For simulations 5 4 and S5, the overhead is always less than 24% of that of RST. 

The runs 5 4 and 5,5 show that increasing multicast frequency, thus increasing 
the network load, does not affect the overhead even under extreme network load 
conditions, i.e., under high MTT. This is because the log sizes have already 
reached a “steady-state” proportion of n 2 and multicasts cannot increase them 
much further. Besides, multicasts effectively distribute the log information faster 
into the system because they convey information to more number of processes. 
Thus when a multicast message is ultimately delivered, it can potentially cause a 
lot of log pruning at the destination. Thus we can conclude that the KS algorithm 
has better performance when compared to RST, even under high MTT. 



4.3 Behavior under Decreasing Communication Load 

The next set of simulations is aimed at determining the overhead behavior when 
the KS algorithm is used in applications that use communication sparingly. The 
values of (N, MTT, M/T) were fixed at 5i(10, 100ms, 0.1), 52(15, 100ms, 0.1), 
53(15, 800ms, 0.1), and 5 4 (20, 100ms, 0.1) while varying MIMT from 100ms to 
12800ms, initially in steps of 100 and later in multiples of 2. The results for the 
average message overhead are shown in figure 4. As we were testing the system 
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for behavior under light to moderate loads, we did not increase the traffic by 
increasing M/T. 



Overhead 
in 

% of RST 



Message overhead for increasing MIMT 




MIMT in milli-seconds 



Fig. 4. Average message overhead as a function of MIMT 



The results show a steep initial increase in overheads with increasing MIMT, 
followed by a leveling off to a steady overhead. Low MIMT means that messages 
are generated more frequently. As analyzed before, frequent message delivery dis- 
seminates log information faster and thus helps purge log entries. With increasing 
MIMT, message delivery information required by the Propagation Constraints 
to perform pruning of logs takes longer time in reaching all the processes that 
have the log record of a message send event. Hence the pruning of logs slows 
and log records grow in size with increasing MIMT. However as MIMT becomes 
very high, the generation of messages becomes infrequent. As new messages are 
generated very infrequently, the growth of a process’s log is reduced. This causes 
the log growth rates to level off for high MIMTs. 

Note that despite the steep initial increase, the overheads are always much 
less than those of RST. This is true even in the case of the 10 process simulation 
where, though the overhead is higher than for all other runs, it is still always 
lesser than 45% of that of RST. 

4.4 Overhead for Increasing Multicast Frequency 

The sensitivity of the KS algorithm to multicast frequency is of interest because 
multicasts seem to favor the pruning of logs. 

We ran six simulation runs increasing M/T from 0.1 to 1.0 in steps of 0.1. The 
number of processes was varied starting from 25 and decreased to 12 across the 
simulations. MTT was progressively increased from 50ms to 500ms across the 
six runs. MIMT was varied from 400ms to 1000ms. The results for the average 
message overhead are shown in figure 5. 

For the two simulation runs (N, IXD'T, MIMT) = Si (25, 50ms, 400ms) and 
5*2 (20, 50ms, 400ms) , the overheads are almost constant. For all the other runs, 
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Message overhead for increasing M /T 



(N MTT MIMT) 

(25 50ms 400ms) 

(20 50ms 400ms) H — 
(15 500ms 400ms) -a- 
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(12 500ms 500ms) -* — 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
M/T 




Fig. 5. Average message overhead as a function of M/T 



they decrease with increase in M/T. The two simulation runs S\ and S/ represent 
networks of higher speed and more processes than the other runs. Because of the 
prompt delivery of all messages, increasing multicast frequency cannot decrease 
the overhead from the already existing minimal overhead. However for the other 
simulations, which have high MTT and/or high MIMT, increasing multicasts 
causes more efficient distribution of information which is useful to prune logs by 
effective application of the Propagation Constraints. 

This experiment reaffirms our guess about the performance under high loads. 
Despite increasing network traffic by increasing M/T, the overheads decrease. 

5 Concluding Remarks 

This paper conducted a performance analysis of the space complexity of the 
optimal KS algorithm under a wide range of system conditions using simulations. 
The KS algorithm was seen to perform much better than the canonical RST 
algorithm under the wide range of network conditions simulated. In particular, 
as the size of the system increased, the KS algorithm performed very well and 
had an overhead rate of less than 10% of that for the canonical RST algorithm. 
The algorithm also performed very well under stressful network loads besides 
showing better scalability. As such, the KS algorithm which has been shown 
theoretically to be optimal in the space overhead does offer large savings over 
the standard canonical RST algorithm, and is thus an attractive and efficient 
way to implement the causal message ordering abstraction. 
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Abstract. We present a register efficient implementation of Mergesort 
which we call FAME (Finite Automaton MErgesort). FAME is a m- way 
Mergesort. The m streams are merged by organizing comparison tourna- 
ments among the elements at the heads of the streams. The winners of 
the tournament form the output stream. Many ideas are used to increase 
efficiency. First, the heads of the streams are maintained in the register 
file. Second, the tournaments are evaluated incrementally, i.e. after one 
winner is output the next tournament uses the results of the compar- 
isons performed in the preceding tournaments and thus minimizes work. 
Third, to minimize register movement, the state of the tournament is 
encoded as a finite automaton. We experimented with 8-way and 4-way 
FAME on an Ultrasparc and a DEC Alpha and found that these al- 
gorithms were better than cache-cognizant Quicksort algorithms on the 
same machines. 



1 Introduction 

It has been recently noted that explicit management of the memory hierarchy is 
necessary to get high performance for many application problems. In this paper 
we consider the question for sorting. We are specifically concerned with the 
highest level of the memory hierarchy: registers, though we also pay attention 
to the cache. 

Many sorting algorithms that attempt to adapt to the memory hierarchy have 
been proposed and studied in the literature. These studies have been concerned 
with the lower levels of the memory hierarchy, e.g. disk jTKtffl . or cached or 
disk and cache 0. To the best of our knowledge, little has been reported on how 
to exploit registers efficiently. Since modern processors have a large number of 
registers (typically 32), and these tend to have higher bandwidth to the ALU or 
lower access time or both, we believe there is an opportunity here to improve 
performance further. 

Many of the ideas that have arisen while developing cache/disk cognizant 
sorting algorithms are also relevant while optimizing for register usage. One such 
idea is tiling : instead of making several passes over the entire dataset, it is better 
to break the dataset into tiles which can fit in the cache and then process each 
tile intensively as possible. Typically, recursive implementations of Quicksort and 
Mergesort implicitly employ tiling, while their common iterative formulations do 
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not. Another idea is to use multiway merging - for m way merging the number 
of passes needed over data in Mergesort is log m N for N keys. Clearly, large m 
is to be preferred. These techniques have been used by LaMarca and LadnerjJ 
as well as Nyberg et al 15 - 

While ideas such as multiway merging are useful for efficiently using registers, 
new ideas are also needed because unlike memory, registers cannot be addressed 
indirectly, i.e. most instruction sets do not provide ways to say “compare register 

1 and register j, where i and j are themselves in registers k and 1”. Second, the 
number of registers is usually fairly small - this makes it possible to use special 
coding techniques. 

In this paper we describe a register efficient sorting algorithm called FAME 
(Finite Automaton MErgesort). FAME is a m - way Mergesort. The m streams 
are merged by organizing comparison tournaments among the elements at the 
heads of the streams. The winners of the tournament form the output stream. 
FAME exploits registers by maintaining the keys at the head of the streams in 
the register file. In FAME the tournaments are evaluated incrementally, i.e. after 
one winner is output the next tournament uses the results of the comparisons 
performed in the preceding tournaments and thus minimizes work. This needs 
the state of the tournament to be somehow recorded. FAME does this in a novel 
manner: the state of the tournament is encoded in a finite automaton. The basic 
merging mechanism of FAME is driven by this finite automaton, so effectively 
the state of the finite automaton (and the tournament) is encoded in the program 
counter. 

We describe 2 implementations of FAME and compare them with a Mem- 
ory tuned Quicksort implemented along the lines described by LaMarca and 
Ladner 0. We chose Quicksort as the basis for comparison mainly because it 
turns out to be the best algorithm for the range (4K to 4M keys) studied by 
them. We present comparisons on 2 machines - a Tandem(DEC) Alpha-250 and 
an Ultrasparc. On both these machines, FAME implementations perform better 
than Quicksort. 

Our work is an extension of some of the work in . 

Outline: In Section 0 we present the Finite Automaton MErgesort. Section El 
describes the main features of our benchmark Quicksort. In Section 0 we report 
our experiments. Section 0 discusses our conclusions and future work. 

2 FAME 

The basic merging iteration is as follows: 

1. Conduct a tournament between the keys at the head of all (non-decreasing) 

streams to find the smallest key. 

2. Append the winner to the output stream. 

3. Advance the stream in which the winner was found. 

Obviously, in each iteration the entire tournament need not be played out ex- 
plicitly; after all, nearly the same keys participate in the ith tournament as well 
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as the i + 1th (the only difference being the key that won the ith tournament). 
More precisely, let T,; denote the transcript for the ith tournament (assuming it 
is played out completely). If m is the number of streams being merged, then a 
transcript is simply a sequence of m — 1 bits that record (in some fixed order) the 
results of the to — 1 comparisons performed in the tournament. The important 
observation is: Transcript T* and Ti + i differ in at most log to bits. To see this, 
suppose that the key at the head of the sth stream won the ith tournament. 
Then the keys involved in the i + 1th tournament are the same as those in the 
ith tournament, except for the key from stream s. Thus, the transcript can only 
be different for the log to comparisons from the sth leaf to the root. 

Thus, our basic action requires only log m comparisons to be performed. The 
important question, however, is how to keep track of the transcript conveniently 
so that at each point the comparisons to be peformed are known easily. 

To do this, our code is organized as a finite automaton. In each state of 
the automaton one basic action takes place, and then the automaton transits 
to another appropriate state. States of the automaton corresponds to possible 
transcripts of the tournament: after completing the tournament for finding the 
ith winner, the automaton enters the state Ti (as defined above). 

Action in each state of the automaton: In each state all we need to do is to 
incrementally compute the next transcript. From the description of the current 
state (transcript for the last tournament) we know which stream s contributed 
the previous winner, and hence the new key for the current tournament, and 
hence which log to comparisons need to performed to complete the current tour- 
nament. The code for these comparisons is “hard-wired” into the code for each 
state. When a state is visited, these comparisons get performed. As a result the 
winner is known and appeneded to the output stream. The stream from which 
the winner came is advanced. Finally, we transit to the state corresponding to 
the newly constructed transcript. 

We give a detailed example. Consider the case of to = 4 streams, denoted 
SO) Si> S2j S3. Let H(s) denote the key at the head of stream s. The tournament 
tree is a complete binary tree with 4 leaves, and has 3 internal nodes, i.e. 3 
comparisons need be performed. The transcript for such a tournament thus has 
3 bits b()bi 62 - The bits are interpreted as follows, bo = 1 if H(sq) < H(si), and 
0 otherwise. 61 = 1 if ff(s2) < i?(s3) and 0 otherwise. 62 = 1 if the smaller of 
H(sq), H(si) is smaller than the smaller of H(s 2 ), i?(s3) and 0 otherwise. The 
Automaton has 8 states. Figure El shows the code for state 101 as an example. 

The code is based on the following observations. First we know that for the 
previous tournament H(sq) < H{s 1), H(s2) > #(^3) and H(sq ) < H(ss). Thus 
H(so) was the smallest. Thus, H(s 0) is new when the automaton reaches this 
state. Hence we need to compare the new H(sq) with H(s 1), and the smaller 
among them with H^ss). Depending upon which way the comparisons go, this 
will cause either H(s 0), H(si) or H(ss) to be determined as the smallest. The 
smallest key thus determined is appended to the output stream; stream from 
which it came is advanced, and finally a transition is made to the appropriate 
next state. 
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StatelOl: If //(so) < H(s 1 ) then 

If //(so) < H(s 3 ) then 

Append //(so) to output stream. Advance so- Go to StatelOl. 
else 

Append H(s 3 ) to output stream. Advance S3. Go to StatelOO. 

else 

If H(s 1 ) < H(s 3 ) then 

Append i/(s 1 ) to output stream. Advance si. Go to StateOOl. 
else 

Append H(s 3 ) to output stream. Advance S3. Go to StateOOO. 
Fig. 1. Code for State 101 



Figure ID omits some details, e.g. checking whether streams have ended. In the 
actual code, we use sentinels to terminate each stream. With this, it suffices to 
count the number of keys inserted into the output stream - when the requisite 
number are inserted, then we simply append the sentinel and the merging is 
declared complete. More importantly sentinels ensure that none of the streams 
terminates prematurely - at least the sentinel of the stream will stay till the end. 

Use of Registers: The motivation for the above mechanism is of course the 
use of registers. The keys at the heads of all streams and pointers to the stream 
heads are maintained in registers. An additional register is needed to count the 
number of keys that have been appended to the output stream. Thus 2m + 1 
registers are needed. 

Code length: Note that the code for any state must explicitly and separately 
deal with all the outcomes of log to comparisons. Thus there are to cases to 
consider, so the code will have length 0(m). Since there are 2 m ~ 1 states whose 
code must be written out explicitly, the total code length is thus 0(m2 m ). 



2.1 Comments 

We note that the merging mechanism described above has several good features. 
First, it actually allows a large amount of the state to be held in registers during 
processing. Second, it completely eliminates register movements and in general 
unnecessary copying. An alternative to our method would be to organize the 
tournament as a heap - we coded this approach but it was hopeless due to the 
index calculations and register moves involved in navigating through the heap. 

The main drawback of the scheme, in our opinion, is that the code for it 
is long. For example, the code for 8 way FAME exceeded the instruction cache 
length on one of the machines we experimented with (DEC 3000). If the in- 
struction cache is long enough, as was the case for the two machines we studied 
extensively, this is not a serious problem. 

It should be noted however, that while FAME attempts to minimize regis- 
ter moves (practically eliminates them), this may not be important on modern 
superscalar machines where the moves might be executed in parallel anyway. 
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2.2 Implementation Details 

We did two implementations of FAME, one for m = 4 and the other for m — 8. 
The choice of m is governed by several factors. First, to keep the tournament 
tree simple, we would like it to be a power of 2. Second, to implement the merge 
mechanism well we need the processor to have 2m + 1 registers at least. Since 
most modern processors have about 32 registers, this limits m to 8. Finally, the 
code length is proportional to m2 m_1 . For m = 16 this would be too large, 
unlikely to fit in the instruction caches of most processors. 

Our implementation was in the C language, and the code for the merging 
mechanism described above was generated by using C preprocessor macros. This 
is inevitable, especially for the case m = 8 which has 128 states. The code for 
each state is also considerably long, since it must handle each of the 8 possible 
ways in which 3 comparisons can get resolved. To ensure that the heads of the 
streams and the pointers to heads are kept in registers, we declared the associated 
variables to be of storage class register^j 

The use of the sentinels can be expensive (excessive memory requirement) 
for merging very short sequences. For example, if we were to merge length 1 
sequences, then half the memory would be taken up by sentinels. We avoided 
this by using Quicksort to sort very short sequences. The threshold for this 
was determined experimentally: it was about 16 for the DEC alpha, 128 for the 
Ultrasparc and 512 for the RS 6000. 

Our merging code was organized recursively. This organization has the over- 
head of recursion, but it ends up exploiting locality and gives better cache per- 
formance. 

3 Quicksort 

Our Quicksort is based on the memory tuned Quicksort of LaMarca and Ladner |F 
and includes several standard optimizations: (i) For sequences of length 32 we 
performed insertion sort, (ii) Instead of selecting the splitter at random we chose 
it to be the median of 3 randomly chosen elements, (iii) The algorithm was coded 
in a recursive manner which is known to exploit locality better and thus give 
better cache behaviour. The implementation was in the C language and was 
compiled with highest level of optimization. 

4 Experiments 

We experimented extensively on two machines: a DEC Alpha, and an Ultrasparc. 
The complete configurations are as follows 





Registers 


LI D-Cache 


LI I-Cache 


L2 Cache 


Clock Speed 


Alpha-250 


32 


16 KB 


16 KB 


2 MB 


266 MHz 


Ultrasparc 


24 (window) 


16 KB 


16 KB 


512 KB 


167 MHz 



1 We examined the generated code and found that the compiler had indeed allocated 
registers as we had directed. 
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We report the time to sort using various algorithms and various numbers of 
keys. In all cases the keys are 32 bit integers. The sorting time is reported on 
a per key basis, so that to calculate the actual sorting time it is necessary to 
multiply the plotted time by the number of keys. Three algorithms are considered 
(i) Quicksort (ii) 4-way FAME, (iii) 8-way FAME. The reported time is the 
median of the times taken for sorting 100 randomly generated data sets. For 
consistency, the same data sets were used for all algorithms. We have not shown 
the standard deviation, but it was less than 1 % for FAME for more than 4096 
keys, and less than 3 % for Quicksort. The mean was very close to the median 
(easily within the standard deviation). 

We also ran experiments on a DEC 3000 and an Intel Pentium. On the Pen- 
tium FAME performed badly. This is to be expected because the Pentium does 
not have many registers. The behaviour on the 3000 is discussed in Section |T51 

4.1 Alpha-250 Results 



GRAPH 1 : 32 bit keys on Alpha-250 




It is seen that the 8-way FAME outperforms Quicksort for more than 8192 
keys. Beyond 65536 keys, even the 4-way FAME performs better, except for 
524288 keys. 

The plots for FAME show several minor blimps and one major bump at 
524288 keys. At 524288 keys the data no longer fits the 2MB secondary cache 
(remember that the Mergesort requires about twice the memory of Quicksort), 
it is because of this that we think the time steeply rises. It should be noted 
that 8-way FAME performs better than Quicksort inspite of this rise. Quicksort 
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also shows this steep rise, but it happens at 1 M keys. At 1 M keys and more, 
Quicksort performance is worse than 8 way as well as 4 way FAME. 

The minor bumps in the plots for FAME arise probably because of the way 
in which FAME switches to Quicksort. Nominally the switch happens when the 
number of keys drop down below 32. But because the merging is 4 way or 8 way, 
the switch could happen at either 8 keys, or 16 keys or 32 keys. 



4.2 Ultrasparc Results 



GRAPH 2: 32 bit keys on UltraSparc 




It is seen that 4- way FAME outperforms Quicksort for more than 16K keys. 
8-way FAME outperforms Quicksort beyond 512K keys, but is itself worse than 
4-way FAME. The plots here are much smoother than those for the Alpha-250, 
presumably because the processor is slower and thus better matches the memory 
speeds. 



4.3 DEC 3000 

The interesting observation here was that 4-way FAME was better than Quick- 
sort, but 8 way FAME was not. Our DEC 3000 (MIPS processor) machine had 
a small instruction cache, which was easily seen to be incapable of holding the 
entire code for the 8 way FAME. Thus, we believe that the poor performance 
of 8 way FAME was due to instruction cache thrashing. It will be interesting to 
run the experiments on a MIPS architecture with a larger cache. 
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5 Concluding Remarks 

We expected 8 way FAME to outperform both the 4 way FAME and the Quick- 
sort. The number of comparisons made by all the algorithms are roughly similar, 
the difference is in the number of passes over memory. For N keys, Quicksort 
makes log 2 N passes over memory, 4 way FAME makes log 4 N passes and 8 way 
FAME log 8 N passes. 

Our expectation held up for the Alpha-250, but not for the Ultrasparc. A pos- 
sible explanation is that the code simplicity of 4 way FAME compensates for its 
additional passes over memory. But this needs more analysis, certainly. It should 
be noted that modern processors are very complex - with lots of mechanisms 
such as superscalar issue, pipeline interlocks, branch prediction mechanisms etc. 
coming into play. The final performance is a result of these complex interactions. 

In summary, we have presented a strategy for exploiting registers in sorting 
programs. We have also presented a preliminary evaluation of the strategy on two 
architectures, where the strategy was seen to yield performance improvements. 
It will be interesting to evaluate the strategy on other processor architectures. 

We suspect that much more work can be done in this area. Following LaMarca 
and Ladner, it might be interesting to seek register efficient implementations for 
all the standard sorting algorithms such as Selection sort, Heapsort, Radix sort 
and Quicksort, not just Mergesort. While we suspect that the only serious chal- 
lenger to FAME will be a register efficient Quicksort (say a multiway Quicksort), 
studying the register efficient variants of even selection sort will be of interest, 
since that gets used when the number of keys drops down below 32. For this size, 
selection sort can be potentially extremely efficient because it could be made to 
run entirely inside the registers! 
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Abstract. An increasing number of mission-critical systems are being developed 
using distributed object computing middleware, such as CORBA. Applications 
for these systems often require the underlying middleware, operating systems, 
and networks to provide end-to-end quality of service (QoS) support to enhance 
their efficiency, predictability, scalability, and fault tolerance. The Object Manage- 
ment Group (OMG), which standardizes CORBA, has addressed many of these 
QoS requirements the recent Real-time CORBA and Fault Tolerant CORBA (FT- 
CORBA) specifications. This paper describes the patterns we are incorporating 
into a FT-CORBA service called DOORS to eliminate performance bottlenecks 
caused by common implementation pitfalls. 



1 Introduction 

Emerging trends: Applications for next- generation distributed systems are increas- 
ingly being developed using standard services and protocols defined by distributed ob- 
ject computing middleware, such as the Common Object Request Broker Architecture 
(CORBA) [Jp . CORBA is a distributed object computing middleware standard defined 
by the OMG that allows clients to invoke operations on remote objects without concern 
for where the object resides or what language the object is written in 0- In addition, 
CORBA shields applications from non-portable details related to the OS/hardware plat- 
form they run on and the communication protocols and networks used to interconnect 
distributed objects. These features make CORBA ideally suited to provide the core 
communication infrastructure for distributed applications. 
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A growing number of next-generation applications demand varying degrees and 
forms of quality of service (QoS) support from their middleware, including efficiency, 
predictability, scalability, and fault tolerance. In CORBA-based middleware, this QoS 
support is provided by Object Request Broker (ORB) endsystemsORB endsystems con- 
sist of network interfaces, operating system I/O subsystems, CORBA ORBs, and higher- 
level CORBA services. 

Addressing middleware research challenges with patterns: Our prior research on 
CORBA middleware (www . cs . wustl . edu/Aschmidt/corba- re search . html) 
has explored many efficiency, predictability, and scalability aspects of ORB endsystem 
design, including static and dynamic scheduling, event processing, I/O subsystem and 
pluggable protocol integration, synchronous and asynchronous ORB Core architectures, 
systematic benchmarking of multiple ORBs, optimization principle patterns for ORB 
performance, and measuring performance of a CORBA fault- tolerant service. This paper 
focuses on another dimension in the ORB endsystem design space: applying patterns to 
improve the performance of Fault Tolerant CORBA (FT-CORBA) implementations. 

A pattern names and describes a recurring solution to a software development prob- 
lem within a particular context m. Patterns help to alleviate the continual re-discovery 
and re-invention of software concepts and components by documenting and teaching 
proven solutions to standard software development problems. For instance, patterns are 
useful for documenting the structure and participants in common communication soft- 
ware micro-architectures, such as active objects 0 and brokers 0 . These patterns are 
generalizations of object-structures that have been used successfully to build flexible, 
efficient, event-driven, and concurrent communication software, including ORB mid- 
dleware. 

In general, patterns can be categorized as follows: 

Design patterns: A design pattern 0 captures the static and dynamic roles and rela- 
tionships in solutions that occur repeatedly when developing software applications in 
a particular domain. The design patterns we apply to improve the performance of FT- 
CORBA include: Abstract Factory, Active Object, Chain of Responsibility, Component 
Configurator, and Strategy. 

Architectural patterns: An architecture pattern 0 expresses a fundamental structural 
organization schema for software systems that provides a set of predefined subsystems, 
specifies their responsibilities, and includes rules and guidelines for organizing the rela- 
tionships between them. The architectural patterns we apply to improve the performance 
of FT-CORBA include: Leader/Followers and Reactor. 

Optimization principle patterns: An optimization principle pattern Q documents 
rules for avoiding common design and implementation mistakes that degrade the perfor- 
mance, scalability, predictability, and reliability of complex systems. The optimization 
principle patterns we applied to improve performance of FT-CORBA include: optimiz- 
ing for the common case, eliminating gratuitous waste, and storing redundant state to 
speed up expensive operations. 
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Paper organization: The remainder of this paper is organized as follows: Section 0 
summarizes the recently adopted Fault Tolerant CORBA (FT-CORBA) specification. 
Section ^describes the patterns we are using to improve the performance of our FT- 
CORBA service called DOORS; and Section^presents concluding remarks. 

2 Overview of the Fault Tolerant CORBA Specification 

The Fault Tolerant CORBA (FT-CORBA) Q| specification defines a standard set of 
interfaces, policies, and services that provide robust support for applications requiring 
high reliability. The fault tolerance mechanism used to detect and recover from fail- 
ures is based on entity redundancy. Naturally, in FT-CORBA the redundant entities are 
replicated CORBA objects. 

Replicas of a CORBA object are created and managed as a “logical singleton” 3| 
composite object. FigurcQdlustratcs the key components in the FT-CORBA architecture. 
All components shown in the figure are implemented as standard CORBA objects, i.e., 
they are defined using CORBA IDL interfaces and implemented using servants that 
can be written in standard programming languages, such as Java, C++, C, or Ada. The 
functionality of each component is described below. 




Fig. 1 . The Architecture of Fault Tolerant CORBA 
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Interoperable object group references (IOGRs): FT-CORBA standardizes the format 
of interoperable object references (IOR) used for the individual replicas. An IOR is a 
flexible addressing mechanism that identifies a CORBA object uniquely □ . In addition, 
it defines an IOR for composite objects called the interoperable object group reference 
(IOGR). 

FT-CORBA servers can publish IOGRs to clients. Clients use these IOGRs to invoke 
operations on servers. The client-side ORB transmits the request to the appropriate 
server-side object that handles the request. The client application need not be aware 
of the existence of server object replicas. If a server object fails, the client-side ORB 
cycles through the object references contained in the IOGR until the request is handled 
successfully by a replica object. The references in the IOGR are considered invalid only 
if all server objects fail, in which case an exception is propagated to the client application. 

ReplicationManager: This component is responsible for managing replicas and con- 
tains the following three components: 

1. PropertyManager: This component allows properties of an object group to be se- 
lected. Common properties include the replication style, membership style, consistency 
style, and initial/minimum number of replicas. Example replication styles include the 
following: 

- COLD_PASSlVE - In this replication style, the replica group contains a single primary 
replica that responds to client messages. If a primary fails, a backup replica is 
spawned on-demand to function as the new primary. 

- WARM -PASSIVE - In the WARM -PASSIVE replication style, the replica group con- 
tains a single primary replica that responds to client messages. In addition, one or 
more backup replicas are pre-spawned to handle crash failures. If a primary fails, 
a backup replica is selected to function as the new primary and a new backup is 
created to maintain the replica group size constant. 

- ACTIVE - In the ACTIVE replication style all replicas are primary and handle client 
requests independently of each other. To ensure a single reply sent to the client and 
to maintain consistent state amongst the replicas, a special group communication 
protocol is necessary. 

Membership of a group and data consistency of the group members can be controlled 
either by the FT-CORBA infrastructure or by applications. FT-CORBA standardizes both 
application-controlled and infrastructure -controlled membership and consistency styles. 

2. GenericFactory: For the infrastructure-controlled membership style, the Gener- 
icFactory is used by the ReplicationManager to create object groups and 
individual members of an object group. 

3. ObjeetGroupManager: For the application-controlled membership style, applica- 
tions use the Ob j ectGroupManager interface to create, add, or delete members of 
an object group. 
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Fault Detector and Notifier: FaultDetectors are CORBA objects responsible 
for detecting faults via either a pull-based or a push-based mechanism. A pull-based 
monitoring mechanism periodically polls applications to determine if their objects are 
“alive.” FT-CORBA requires application objects to implement a PullMonitorable 
interface that exports an is_alive operation. A push-based monitoring mechanism 
can also be implemented. In this scheme, which is also known as a “heartbeat monitor,” 
applications implement a PushMonitorable interface and send periodic heartbeats 
to the FaultDetector. 

FaultDetectors report faults to FaultNotif iers. In turn, the FaultNo- 
t i f i e r s propagate these notifications toaReplicati onManage r, which performs 
recovery actions. 

Logging and Recovery: FT-CORBA defines a logging and recovery mechanism that 
is responsible for intercepting and logging CORBA GIOP messages from client objects 
to servers. Distributed applications can employ this mechanism via an infrastructure- 
controlled consistency style. If a failure occurs, a new replica is chosen to become the 
“primary.” The recovery mechanism then re-invokes the operations that were made by 
the client, but which did not execute due to the primary replica’s failure. In addition, 
it retrieves a consistent state for the new replica. The logging and recovery mechanism 
ensures that failovers are transparent to applications. For the application-controlled con- 
sistency style, applications are responsible for their own failure recovery. 

FT-CORBA is designed to prevent single points of failure within a distributed object 
computing system. As a result, each component described above must itself be replicated. 
Moreover, mechanisms must be provided to deal with potential failures and recovery. 



3 Applying Patterns to Improve DOORS Fault Tolerant CORBA 
Performance 

Implementations of FT-CORBA, such as DOORS, are representative of complex com- 
munication software. Optimizing this type of software is hard since seemingly minor 
“mistakes,” such as poor choice of concurrency architectures and data structures, lack 
of caching, and the inability to configure parameters dynamically, can adversely affect 
performance and availability. Therefore, developing high-performance, predictable, re- 
liable, and robust software requires an iterative optimization process that involves (1) 
performance benchmarking to identify sources of overhead and (2) applying patterns to 
eliminate the identified sources of overhead. The patterns described in this section are 
shown in Table [D 

| |SI' )| | describe a family of optimization principle patterns and illustrate how they have 
been applied in existing protocol implementations, such as TCP/ 1 P and CORBA HOP, 
to improve their performance. Likewise, our prior research on developing extensible 
real-time middleware pnibft has enabled us to document the design, architectural, and 
optimization principle patterns used to improve performance and predictability. 

This section focuses on the various design, architectural, and optimization princi- 
ple patterns we are applying to systematically improve the performance of the DOORS 
FT-CORBA implementation. We focus on these patterns since they were the most strate- 
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Table 1 . Patterns for Implementing FT-CORBA Efficiently 



# 


Problem 


Pattern 


Pattern Category 


1 


Missed Polls in FaultDetector 


Leader/Followers 


Architectural 


2 


Excessive overhead of recovery 


Active Object 


Design 


3 


Excessive overhead of service lookup 


Optimize for the common case 


Optimization 


Eliminate gratuitous waste 


Optimization 


Store extra information 


Optimization 


4 


Tight coupling of data structures 


Strategy 


Design 


Abstract Factory 


Design 


5 


Inability for dynamic configuration 


Component Configurator 


Design 


6 


Property lookup 


Chain of Responsibility 


Design 


Perfect hash functions 


Optimization 



gic to eliminate sources of overhead in DOORS FT-CORBA that arose from common 
implementation pitfalls. 

In the following discussion, we outline the forces underlying the key design chal- 
lenges that arise when developing high-performance FT-CORBA middleware, such as 
DOORS. We also describe which patterns resolve these forces and explain how these 
patterns are used in DOORS. In general, the absence of these patterns leaves these forces 
unresolved. 



3.1 Decoupling Polling and Recovery 

Context: In DOORS, FT-CORBA objects operating under the PULL-based fault mon- 
itoring style are polled at specific intervals of time by a separate poller thread in the 
FaultDetector. In the event of failure, this poller thread identifies an application 
object crash and reports the failure to the ReplicationManager, which then per- 
forms the recovery. 

Problem: If the polling thread polls and reports failures in the same thread of control, 
then in the event of failure it will block until the Replicati onManage r has recovered 
from failure. This will cause missed polls to other application objects. This behavior is 
unacceptable for systems requiring high availability. A naive solution would be to create a 
separate polling thread for each application object. This strategy does not scale, however, 
as the number of objects polled by the fault monitor increase. Thus, the force that must 
be resolved involves ensuring that the FaultDetector polls all application objects 
at the specified intervals, even when the poller thread is blocked during failure recovery. 

Solution — * the Leader/Followers pattern: An effective way to avoid unnecessary 
blocking is to use the Leader/Followers pattern 0 - This pattern provides an efficient 
concurrency model where multiple threads take turns sharing a set of event sources in 
order to detect, demultiplex, dispatch, and process service requests that occur on these 
event sources. 
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Fig. 2. Applying the Leader/Followers Pattern in DOORS 



Figure|3illustrates how this pattern is implemented in DOORS ’s Faul t De t ec t or . 
A pool of threads is allocated a priori to poll a set of application objects. One thread 
is elected as the leader to monitor the application objects. When a failure is detected, 
one of the follower threads is promoted to become the new leader, which then polls 
the remaining application objects. In contrast, the previous leader thread informs the 
ReplicationManager of the failure and blocks until recovery completes, at which 
point the previous leader thread becomes a follower. This pattern resolves the force of 
polling all application objects, even when the poller is blocked on the recovery of a failed 
object. 



3.2 Decoupling Recovery Invocation and Execution 

Context: When the PULL-based monitoring style is used in the DOORS implementation, 
the poller thread of the FaultDetector is responsible for polling application objects 
at constant time intervals. Whenever an application object fails to respond to the poll 
message, the FaultDetector must report the failure to the ReplicationMan- 
ager. In contrast, the ReplicationManager can receive such reports from more 
than one FaultDetector. The DOORS’s ReplicationManager serializes the 
failure report requests by handling them sequentially. For performance-sensitive appli- 
cations with high availability requirements, it is imperative that the Replication- 
Manager be notified of failures and that recovery occur within a bounded amount of 
time. 

Problem: The blocking behavior of the FaultDetector’s poller thread and missed 
polls discussed earlier precludes the propagation of failure reports from other application 
objects monitored by the same FaultDetector to the ReplicationManager. 
Also, since the ReplicationManager serializes all the failure reports, this degrades 
its responsiveness. In production systems, a ReplicationManager may receive 
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many failure reports. Handling the failure reports sequentially incurs significant delay 
in the recovery process for queued requests. 

A naive solution based on creating a thread per-report failure request scales poorly in 
a dynamic environment where failure requests may arrive in bursts. In addition, thread 
creation is expensive and inefficient programming may yield excessive synchronization 
overhead. Thus, the forces that must be resolved involve ensuring faster response to 
failure reports and faster recovery. Resolving these forces enables lower time to attain 
stability and hence higher availability. 

Solution — > the Active Object pattern: An efficient way to optimize system recovery 
and stabilization is to use the Active Object pattern 0|. This pattern decouples method 
execution from method invocation to enhance concurrency and to simplify synchronized 
access to an object that resides in its own thread. 

In DOORS, the invocation thread of the FaultDetector calls the 
report-failure operation on the proxy object of the ReplicationManager 
which is exposed to it. Figure 0 shows how the FaultDetector can call the re- 
port-failure operation on the ReplicationManager proxy. This call is made 
in the FaultDetector’s thread of control. The proxy then hands off the call to the 
scheduler of the ReplicationManager, which enqueues this call and returns control 
to the FaultDetector. The call is then dispatched to the ReplicationManager 
servant, which executes this call in the ReplicationManager’s thread of control. 

3.3 Caching Object References of FaultDetector in the ReplicationManager 

Context: The FT-CORB A standard does not specify how the Rep 1 i c a t i onManage r 
informs the FaultDetectors to start monitoring the objects. A typical solution is to 
contact a CORBA Naming Service to obtain the FaultDetector object reference. 
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Fig. 3. Applying the Active Object Pattern in DOORS 
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Problem: Making remote calls to the Naming Service to obtain a FaultDetector 
object reference incurs a cost that affects the object group creation and recovery process. 
This cost becomes unnecessary when the FaultDetector remains unchanged, unless 
it has crashed and is rejuvenated. Hence, the force to be resolved involves minimizing 
the time spent in obtaining the object references of the FaultDetectors. 

Solution — > Optimize for the common case by storing redundant information and 
eliminating gratuitous waste: Unless the FaultDetector has itself crashed, there is 
no need for the ReplicationManager to obtain the object reference of the Fault- 
Detector each time it is needed. Instead, it can cache this information, which avoids 
the round-trip delay of invoking the Naming Service remotely. 

During initialization, the ReplicationManager obtains the FaultDetec- 
tor’s object reference and stores it in an internal table, as shown in Figure 0 The 




Fig. 4. Optimizing for the Common Case in DOORS 



only time DOORS must obtain a new object reference is when the FaultDetector 
crashes, which happens infrequently in a properly configured system. This optimiza- 
tion can improve the time to recovery and system stabilization significantly, thereby 
enhancing the performance and availability of the application. 

3.4 Support Interchangeable Behaviors 

Context: As explained in Section 0 the FT-CORBA standard specifies several proper- 
ties, such as replication styles and fault monitoring styles, and their values, which can 
be set on a per-object group, per-type, or per-domain basis. In addition, the FT-CORBA 
standard provides operations to override these properties or to retrieve their values. The 
ReplicationManager that inherits the PropertyManager interface implements 
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these operations. Moreover, efficient implementations are possible only when efficient 
data structures are used to store and access these properties. The choice of data structures 
depends on (1) the number of properties supported by the ReplicationManager 
and (2) the maximum number of different types of object groups that are permitted. 

Problem: One way to implement FT-CORBA is to provide only static, non-extensible 
strategies that are hard-coded into the implementation. This design is inflexible, however, 
since components that want to use these options must (1) know of their existence, (2) 
understand their range of values, and (3) provide an appropriate implementation for each 
value. These restrictions make it hard to develop highly extensible services that can be 
composed transparently from configurable strategies. 

Solution — > the Strategy pattern: An effective way to support multiple behaviors is 
to apply the Strategy pattern m . This pattern factors out similarities among algorithmic 
alternatives and explicitly associates the name of a strategy with its algorithm and state. 

We are enhancing different components of DOORS, such as the Replication- 
Manager and FaultDetector, to use the Strategy pattern. These enhancements 
enable developers of FT-CORBA middleware to configure these components with im- 
plementations that are customized for their requirements. Figure 0 illustrates how the 
Strategy pattern is applied in DOORS. As shown in this figure, different replication styles 
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Fig. 5. Applying the Strategy Pattern in DOORS 



can be configured as strategies that are selectable by applications at run-time. Moreover, 
new strategies, such as ACTIVE JWITHJVOTING, can be added without affecting existing 
strategies. 

3.5 Consolidating Strategies 

Context: Section IT~H describes how the Strategy pattern can be applied to configure 
various requirements in the FT-CORBA service. There could be multiple strategies that 
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offer various features, such as fault monitoring style or membership style. It is important 
to configure only semantically compatible strategies. 

Problem: An undesirable side-effect from extensive use of the Strategy pattern in 
complex software is the maintenance problems posed by the possible semantic incom- 
patibilities between different strategies. For instance, the FT-CORBA service cannot be 
configured with active replication style and application controlled membership style. In 
general, the forces that must be resolved to compose all such strategies correctly involve 
(1) ensuring the configuration of semantically compatible strategies and (2) simplifying 
the management of a large number of individual strategies. 

Solution — > the Abstract Factory pattern: An effective way to consolidate multiple 
strategies into semantically compatible configurations is to apply the Abstract Factory II 
pattern. This pattern provides a single access point that integrates all strategies used 
to configure the FT-CORBA middleware, such as DOORS. Concrete subclasses then 
aggregate compatible application-specific or domain-specific strategies, which can be 
replaced en masse in semantically meaningful ways. 

In the DOORS FT-CORBA implementation, abstract factories are used to encapsu- 
late internal data structure-specific strategies in components such as Replication- 
Manager and Faul tDetector. Figure0depicts how the property list in the Repl i - 
cationManager uses abstract factories. The property abstract factory encapsulates 




Fig. 6. Applying the Abstract Factory Pattern in DOORS 



the different property strategies, such as the membership strategy, monitoring strategy, 
and replication strategy. By using a property abstract factory, DOORS can be configured 
to have different property sets conveniently and consistently. 
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3.6 Dynamically Configuring DOORS 

Context: FT-CORBA implementations can benefit from the ability to extend their ser- 
vices dynamically, i.e., by allowing their strategies to be configured at run-time. The 
FT-CORBA standard allows applications to dynamically set certain fault tolerance prop- 
erties of the application’s replica group registered with the ReplicationManager. 
These properties include the list of factories that create each replica object of the replica 
group or the minimum number of replicas required to maintain the replica group size 
above a threshold. 

Problem: Although the Strategy and Abstract Factory patterns simplify the customiza- 
tion for specific applications, these patterns still require modifying, recompiling, and 
relinking the DOORS source code to enhance or add new strategies. Thus, the key force 
to resolve involves decoupling the behaviors of DOORS strategies from the time when 
they are actually configured into DOORS. 

Solution — > the Component Configurator pattern: An effective way to enhance the 
dynamism is to apply the Component Configurator pattern B- This pattern employs 
explicit dynamic linking mechanisms to obtain, install, and/or remove the run-time 
address bindings of custom Strategy and Abstract Factory objects into the service at 
installation-time and/or run-time. 

DOORS’s ReplicationManager and FaultDetector use the Component 
Configurator pattern in conjunction with the Strategy and Abstract Factory patterns to dy- 
namically install the strategies they require without ( 1 ) recompiling or statically relinking 
existing code, or (2) terminating and restarting an existing ReplicationManager or 
FaultDetector. Applications can use this pattern to dynamically configure the ap- 
propriate replication style, monitoring style, polling interval, and membership style into 
the DOORS FT-CORBA service. Figure 0shows how these properties are dynamically 
linked. The use of the Component Configurator pattern allows the behavior of DOORS’s 
ReplicationManager and FaultDetector to be customized for specific appli- 
cation requirements without requiring access to, or modification of, the source code. 



DLL's 




Fig. 7. Applying the Component Configurator Pattern in DOORS 
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3.7 Efficient Property Name-Value Lookups 

Context: The ReplicationManager of a FT-CORBA service is required to lookup 
the fault-tolerant properties of the object groups registered with it during object group 
creation and recovery. Properties are also located when an application retrieves them 
or overrides previous values. The FT-CORBA standard defines a hierarchical order in 
which properties must be found. First, the properties must be located for the object group 
that is the target of the request. If it is not found, then a lookup is made on a repository 
that holds properties for all object groups of the same type. If that lookup also fails, 
another lookup is performed on the domain-specific repository that acts as the default 
for all the object groups that are registered with the ReplicationManager. 

Problem: The hierarchical lookup ordering mandated by the FT-CORBA standard 
underscores the need for an efficient strategy to locate fault-tolerance properties. Thus, 
the force that must be resolved involves efficient lookups of fault-tolerance properties 
guided by the order specified in the FT-CORBA standard. 

Solution — > the Chain of Responsibility pattern and perfect hashing: An efficient 
way to perform hierarchical property lookups is to use the Chain of Responsibility 
pattern 0, which decouples the sender of a request from its receiver, in conjunction 
with perfect hashing [0 to perform optimal name lookups. The Chain of Responsibility 
pattern links the receiving objects and passes the request along the chain until an object 
handles the request. Perfect hashing is applicable because the number of properties 
supported by a ReplicationManager can be configured a priori. 

Since the fault-tolerance properties supported by a ReplicationManager are 
determined a priori , the DOORS service uses a perfect hash function generated by GNU 
gperf 1 6i| to perform an 0(1) lookup on the property name. The Chain of Responsibility 
pattern is applied by passing the request from one hash table to the other until the property 
is found or the search fails, as illustrated in Figure 0 




Fig. 8. Applying the Chain of Responsibility Pattern in DOORS 
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4 Concluding Remarks 

A growing number of CORBA applications with stringent performance requirements 
also require fault tolerance support. To address the fault tolerance requirements, the 
OMG recently standardized the Fault Tolerant CORBA (FT-CORBA) specification. The 
most flexible strategy for providing fault tolerance to CORBA applications is via higher- 
level CORBA services. 

To make FT-CORBA usable by performance-sensitive applications it must incur 
negligible overhead. To address these requirements, therefore, an FT-CORBA imple- 
mentation should possess the following properties: 

1 . The fault detection and failovers incurred by servers should be transparent to clients. 

2. Response time to the client should be bounded and predictable, irrespective of server 
failovers. 

3. The overhead incurred by the fault tolerance framework should maintain applica- 
tion performance requirements, such as efficiency and scalability, within designated 
bounds end-to-end. 

We have identified common pitfalls in FT-CORBA implementations that degrade 
performance. To eliminate these overheads, we are applying key design, architectural, 
and optimization principle patterns to improve the performance, extensibility, scalability, 
and robustness of the FT-CORBA implementation. 
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Abstract. This paper describes the design, implementation, and per- 
formance evaluation of a CORBA group membership protocol. Using 
CORBA to implement a group membership protocol enables that proto- 
col to operate in a heterogeneous, distributed computing environment. To 
evaluate the effect of CORBA on the performance of a group membership 
protocol, this paper provides a detailed comparison of the performance 
measured from three implementations of a group membership protocol. 
One implementation uses UDP sockets, while the other two use CORBA 
for interprocess communication. The main conclusion is that CORBA 
can be used to implement high performance group membership proto- 
cols. There is some performance degradation due to CORBA, but this 
degradation can be reduced by carefully choosing an appropriate design. 



1 Introduction 

Group communication services have been proposed as mechanisms to construct 
high performance, highly available, dependable, and real-time applications 
At present, these services are mostly implemented on a homogeneous, distributed 
computing environment. This is a major limitation. Our goal is to address this 
problem of operating group communication services in a heterogeneous, dis- 
tributed computing environment. In particular, we investigate the design, im- 
plementation, and performance of a group communication service implemented 
using the Common Object Request Broker Architecture (CORBA) [ll 1 . 

In this paper, we investigate the design, implementation, and performance 
evaluation of a group membership protocol in CORBA. In particular, we describe 
the implementation and performance of a group membership protocol called the 
three-round, majority agreement group membership service (TRM) W using 
CORBA. 

* This work was supported in part by an AFOSR grant F49620-98- 1-0070. 



M. Valero, V.K. Prasanna, and S. Vajapeyam (Eds.): HiPC 2000, LNCS 1970, pp. 121— 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 1 1 



122 



S. Mishra and X. Lin 



Efforts in providing object replication support using a group communication 
service in CORBA include integration approach [Q , alternate protocol approach 
cm and service approach jfite) . We have chosen a service approach in our 
endeavor to design and implement a CORBA group membership protocol. In 
particular, we address the following questions: (1) how practical is it to imple- 
ment a group membership protocol using the service approach?, (2) what is the 
performance overhead of such an implementation?, (3) what are the sources of 
this performance overhead?, and (4) are the current CORBA specifications suffi- 
cient for implementing a group membership protocol using the service approach? 
We provide answers to these questions by describing three implementations of 
a group membership protocol called the three-round, majority agreement group 
membership protocol IQ. 

The first implementation uses the UDP socket interface, and runs in a homo- 
geneous, distributed computing environment, on a network of SGI workstations 
running IRIX 6.2. This implementation follows the standard techniques that 
have been used in the past to implement group membership protocols. We will 
refer to this implementation as the socket implementation. The second and the 
third implementations use CORBA and run on a network of SGI workstations 
running IRIX 6.2, Sun Sparcstations running Solaris, PCs running Windows 95, 
and PCs running Windows NT 4.0. We will refer to these two implementations as 
the CORBA implementations. The second implementation uses IONA Orbix 2.3, 
while the third implementation uses IONA Orbix 2.3 and UDP socket interface. 
We will refer to the second implementation as the pure CORBA implementation, 
and the third implementation as the hybrid CORBA implementation. 

We provide a performance comparison between the performance measured 
from the three implementations, identify the sources of performance overhead 
due to CORBA, and discuss some techniques to improve the performance of a 
CORBA group membership protocol. The main conclusion of this paper is that 
the current CORBA technology is suitable for implementing a high performance 
group membership protocol in a heterogeneous, distributed computing environ- 
ment using the service approach. While there is some performance overhead due 
to CORBA, it is significantly lower than the performance overhead we observed 
in CORBA atomic broadcast protocol ||E]. Furthermore, this performance over- 
head can be reduced to a certain extent by carefully choosing an appropriate 
design. 



2 Group Communication Service 

Figure 0 shows the relationship between application clients, application servers, 
and a group communication service. An application with high availability, de- 
pendability, and/or real-time responsiveness requirements is constructed by im- 
plementing application servers on multiple machines. These application servers 
replicate the application state and use a group communication service to coordi- 
nate their activities in the presence of concurrent event occurrences, asynchrony, 
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and processor or communication failures. Application clients that need applica- 
tion services interact with one of the application servers. 



Application Client .... : Application Client ; Application Client : ■■■■ j Application Client 




Fig. 1. Relationship between clients, servers, and group communication service. 



We have chosen a group membership protocol called the three round, ma- 
jority agreement group membership protocol (TRM) for implementation using 
CORBA. TRM is one of five group membership protocols presented in 0|. This 
protocol’s high performance and simple design makes it an ideal candidate for 
real world implementations. In this section, we describe this protocol briefly. For 
a detailed description, see 0. 

As the name implies, TRM consists of three rounds of message exchanges. In 
the beginning, a processor creates a new group and joins that group. After that, 
it starts sending probe messages to all processors. When a processor p receives a 
probe message from another processor whose ID is less than its own, it creates 
a new group and sends a new group invitation message to all other processors. 
Processor p is termed as the group creator. Receivers of this message reply with 
a acceptance message if they want to join the new group. When the group creator 
receives acceptance messages from other processors, it includes them as members 
in the new group and sends back a join message to inform them of successful join. 
If a processor receives more than one new group invitation message, it responds 
to the one sent by processor with a greater ID number. Eventually, the processor 
with the greatest ID among a set of processors that can communicate with one 
another forms a group that reaches a stable state. This stability is achieved by 
the leader of the group (member with smallest ID) periodically sending I am 
alive message to its right neighbor, who in turn passes that message to its right 
neighbor, and so on. When the I am alive message gets back to the leader, the 
group is stable. In addition to the I am alive message, the group leader also 
sends probe message to all processors that are not currently in the group. 

In an asynchronous distributed system, there is no bound on communication 
delays. In TRM, a processor takes appropriate actions based on messages re- 
ceived with a fixed time interval. If a member doesn’t receive I am alive message 
from its left neighbor with in a prescribed period, it declares a group failure and 
starts its own group. Similarly, if the group creator doesn’t receive acceptance 
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message from certain processors after sending a new group invitation message, 
it simply omits those processors from the group and sends join messages to 
only those processors that responded in time. On the other hand, if a processor 
responds to the group creator by sending an acceptance message, but doesn’t 
receive a join message in time, it creates a new group. 

3 Implementation 




■>- Socket interface 



Fig. 2. Socket Implementation. 



We have implemented the three round, majority agreement group member- 
ship protocol in three ways. The first implementation, called the socket imple- 
mentation, uses UDP socket interface for communication. Each member is im- 
plemented by a single process that sends, receives and processes messages, and 
maintains the protocol state. Figure 0 shows an outline of this implementation. 

The next two implementations, called the CORBA implementations, use 
CORBA. These implementations use two important properties of CORBA: stan- 
dardized ORB and remote invocation. In accordance with the relationship be- 
tween application clients, application servers, and a group communication service 
(see Figure Ip, each group member provides two sets of IDL interfaces: mem- 
bers-interface and application-interface. The members_interface specifies inter- 
actions between different group members, and the application_interface specifies 
interactions between an application server and a group member. Together, these 
two interfaces allow different application servers and different group members 
to run on machines that have different architectures and run different operating 
systems. In addition, these interfaces allow the implementation of application 
servers and group members in different programming languages. 

Group members running on different hosts communicate with one another 
via the ORB by using the members -interface. This interface allows different 
group members to run on machines of different architectures and use different 
programming languages for their implementation. Figure EH shows the usefulness 
of this interface. The ORB makes remote invocation in CORBA as simple as 
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a local function call. In our design, message transfer between different group 
members has been facilitated by passing messages as parameters of the remote 
invocation. 




Fig. 3. Heterogeneity via the members and application interfaces. 



An application server invokes various group communication service oper- 
ations by using the application Jnterface. This allows application servers and 
group members to run on different machines of different architectures. In addi- 
tion, this interface also enables an application server and a group member to be 
implemented in different programming languages. FigureElshows the usefulness 
of this interface. 

The CORBA implementations use Basis Object Adapter as opposed to the 
Portable Object Adapter of the ORB. Each processor looks for other proces- 
sors using their ID numbers. CORBA expresses network failures with predefined 
exception classes. A client captures these exceptions, and informs them to the 
processor (member) as communication failures. Failures can occur both at server 
lookup and at server method invocation. 

The two CORBA implementations differ from each other in the way a mem- 
ber is implemented. In the pure CORBA implementation, there is no clearly 
defined role for server and client. Each group member implements both a server 
and a client. At the start, a member creates an object for its implementa- 
tion. At initialization, this object exports to the ORB the newly created object 
by using instruction boa. obj_is-ready (this) . It then acts as a server by invok- 
ing boa.impLis-ready(). When a group member needs to send messages to the 
other group members by invoking functions exported by the other members, it 
acts as a client. So, in the pure CORBA implementation, server code and client 
code are implemented in the same program. This implementation is similar to 
the member implementation in the socket implementation. Figure Q] shows this 
implementation for a group of size three. Each member is implemented by a 
single program that is responsible for sending messages, receiving messages, and 
processing messages. 

A consequence of incorporating server and client codes in the same program 
is that there is a need for creating threads for performing various activities. 
In CORBA, an invocation from a client to a server function terminates only 
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when either the server responds, or an exception occurs. If the main thread that 
responds to the clients invocations (acting as a server) invokes another members 
function (acting as a client), it may block for a significant period of time and 
will not be responsive to the client invocations. To avoid this, the main thread 
creates new threads to carry out each client duties. 

In the hybrid CORBA implementation, each group member is implemented 
in two parts: one implements the server functions of a member and the other 
implements the client functions. The two parts are implemented by two separate 
processes called the server process and the client process. Since both processes 
run on the same processor, any form of interprocess communication provided 
by the underlying operating system may be used for communication between 
them. For example, in the Unix operating system, pipes or UDP sockets may 
be used. We have used UDP sockets in our implementation. Figure Oshows this 
implementation for a group of size three. 




Fig. 4. Pure CORBA 
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Process 
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Fig. 5. Hybrid CORBA 



The client process implements the application_interface and the server pro- 
cess implements the members .interface. The server process is responsible for 
receiving messages from other group members. As soon as a message is received, 
it passes that message to the client process. The client process is responsible for 
implementing all group membership functionalities. In particular, it processes 
messages sent by other members, services application server requests, implements 
atomicity, order, and termination semantics, maintains a consistent group mem- 
bership, and so on. This implementation of a group member by two processes 
avoids the need for creating separate threads for various client activities. 

4 Performance Evaluation 

Implementation of the three-round, majority agreement group membership pro- 
tocol using CORBA consists of about 3,000 lines of C++ code, not including 
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the three automatically generated C++ source and header files from the IDL 
compiler. For CORBA, we used Orbix 2.3 from the Iona Technologies JJJ . This 
implementation runs on a network of SGI workstations running IRIX 6.2, Sun 
Sparcstations running Solaris, PCs running Windows 95, and PCs running Win- 
dowsNT 4.0. To evaluate the effect of using CORBA on the performance of a 
group membership protocol, we measured its performance on a network of SGI 
workstations. The reason for this is that the socket implementation runs on a 
network of SGI workstations. 



4.1 Performance Indices 

There are three performance indices we have measured to evaluate the effect 
of CORBA on the performance of the three-round, majority agreement group 
membership protocol: initialization stabilization time , failure stabilization time, 
and recovery stabilization time. The initialization stabilization time measures the 
time to initialize the system. It is the time interval between the moment the last 
processor is started and the moment the group is stable, i.e. when the new group 
leader receives I am alive message for the first time. The failure stabilization time 
measures the time needed to construct a new group after a failure notification 
of a group member is first received. It is the time interval between the moment 
the failed member’s right neighbor determines that its left neighbor has failed 
and the moment the new group is stable. Finally, the recovery stabilization time 
measures the time needed to incorporate a newly recovered processor in the 
group. It is the time interval between the moment the recovered processor is 
started and the moment the next group is stable. 



Table 1. Membership Protocol Performance 
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4.2 Performance 

Performance of the three round, majority agreement group membership protocol 
depends on the values of three constants: S, tt, and g. 5 is a time interval with 
in which any message sent by a processor will most likely be received by the 
intended receiver. Each processor sends ‘I am alive’ message to its right neighbor 
at regular intervals of size tt time units, and g denotes the period with which 
‘probe’ messages are sent by a group member to incorporate any newly recovered 
processor in the group. 

We have adopted the suggestions given in j|J to construct timeout intervals 
based on the values of these three constants: SendProbeTimer = g, 
SendlamAliveTimer = tt, Missing JoinTimer = 3*<5, Missing AcceptTimer = 
2*6, and MissingNeighborTimer = it + rank() * 6, where rank { ) is denotes 
the position of a group member in the group. We have measured various per- 
formance indices from the three implementations for three different values of 
6: 1000 msec, 500 msec, and 200 msec. The values of g and tt were fixed at 
1000 msec in all experiments. Table IDshows the values of the three performance 
indices measured. 



4.3 Performance Analysis 

It is clear form these measurements that the socket implementation provides the 
best performance, followed by the hybrid CORBA and pure CORBA implemen- 
tations. In general, only the failure stabilization time changes with a change in 
the 5 value in the three implementations. The failure stabilization time reduces 
with a reduction in the 6 value. This is because a lower value of 6 implies a lower 
value of Ad is sing AcceptTimer. This results in the leader of the new group wait 
for a shorter duration before sending Iamalive message. The initialization and 
recovery stabilization times are not affected significantly by a change in the S 
value, because all processors most likely respond before the timer expires during 
initialization and recovery periods. 

The initialization and recovery stabilization times in the pure CORBA im- 
plementation are about two times larger than the same times in the socket 
implementation, while they are about 1.5 times larger in the hybrid CORBA 
implementation compared to the same times in the socket implementation. The 
failure stabilization times in the hybrid CORBA implementation are only slightly 
larger than those in the socket implementation, while they are significantly larger 
in the pure CORBA implementation. 

The only difference between the CORBA implementations and the socket im- 
plementation of the three round, majority agreement group membership protocol 
is the method used for interprocess communication. CORBA implementations 
use the CORBA remote object invocation via OrbixORB, while the socket im- 
plementation uses UDP sockets. So, in order to understand the reasons for extra 
performance overhead in the CORBA implementations, we measured the average 
one-way communication delay between two SGI workstations for UDP sockets 
and OrbixORB. 
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For a message size of 100 bytes (approximate size of the three round, ma- 
jority agreement group membership protocol messages), the average one-way 
communication delay was measured to be about 0.6 milliseconds for UDP sock- 
ets and 2.5 milliseconds for OrbixORB. This indicates that the communication 
delay in UDP is nearly four times smaller than that in OrbixORB. As we in- 
creased the message size, this difference decreased. However, since messages in 
the three round, majority agreement group membership protocol are of smaller 
sizes (around 100 bytes), the one-way communication delay difference between 
the two implementations is expected to be about four times. 

The extra performance overhead in all three performance indices in the 
CORBA implementations is due to this difference in one-way communication 
delay. An important point to note here ia that the performance degradation due 
to CORBA in the three-round, majority agreement group membership protocol 
is not as high as it was in the atomic broadcast protocol (B|. The main reason 
for this is that the performance of a group membership protocol is significantly 
affected by the values of various timers that the protocol uses. Generally, the 
values of these timers are much larger than the one-way communication delays of 
the network. For example, the one-way communication delay in our experiments 
was about 2.5 msec, while the values of various timers were more than 300 msec. 
As a result, the effect of the difference in one-way communication delays on the 
performance difference between different implementations is not significant. 

There are two reasons for the poor performance of the pure CORBA imple- 
mentation compared to the hybrid CORBA implementation. The first reason 
is that there is extra overhead in the pure CORBA implementation due extra 
thread creations and their synchronization. As mentioned in Section 0 a new 
thread is created for every client invocation to ensure that these invocations are 
serviced promptly. Some of these threads may run in parallel that gives rise to 
a need for thread synchronization. This thread creation and synchronization is 
avoided in the hybrid CORBA implementation by a clean separation of the client 
and the server functionalities. The second reason for the poor performance of the 
pure CORBA implementation compared to the hybrid CORBA implementation 
arises due to a need for name resolution. In both CORBA implementations, ID 
numbers are used to refer to a processor. So, a name resolution routine needs 
to be executed whenever a reference to a processor is not available. This over- 
head is smaller in the hybrid CORBA implementation than in the pure CORBA 
implementation because this name resolution needs to be done in the hybrid 
implementation only when the server process starts up. References to the server 
processes are always available to the client process, even during failure and re- 
covery. This is not the case in the pure CORBA implementation. 



5 Conclusion 

We have implemented a CORBA group membership protocol using the service 
approach and measuring the performance overhead due to CORBA. Our goal 
was to find out the extent of this performance overhead. The performance degra- 
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dation in the group membership protocol due to CORBA was observed to be 
between 1.5 to 2 times. 

In addition to determining the extent of performance degradation due to 
CORBA in group membership protocols, we have proposed two design tech- 
niques using the service approach to implement group membership protocols. 
One technique uses only the CORBA ORB for interprocess communication, while 
the other uses CORBA ORB and UDP for interprocess communication. We ob- 
served that the second technique results in improving the protocol performance 
to some extent, because it avoids a need for thread creation or synchronization, 
and reduces the number of name resolutions performed. 

The main conclusion we can draw from this work is that the current CORBA 
technology is suitable for implementing a high performance group membership 
protocol in a heterogeneous, distributed computing environment using the ser- 
vice approach. While there is some performance overhead due to CORBA, it 
is significantly lower than the performance overhead we observed in CORBA 
atomic broadcast protocol [Hj- Furthermore, appropraite design can lead to fur- 
ther performance improvement in a CORBA group membership protocol. 
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Abstract. The pervasive presence of portable electronic devices and the 
massive adoption of the Internet technology are changing the shape of modern 
computing systems. Applications are being broken down into smaller 
components that can be modified independently and that can be plugged in and 
out dynamically. All this drives the development of flexible, high scalable 
middleware. An interesting category of such middleware supports the event- 
based paradigm. In this paper we focus on the scalability issues we have faced 
in the development of an event-based middleware called JEDI. In particular, we 
focus on improving performances of such middleware when the number and 
distribution of components grows. In order to evaluate design alternatives, we 
have taken a simulative approach that has allowed us to analyze the design 
alternatives before actually implementing them. 



1 Introduction 

Modern computing systems are more and more oriented to serve scenarios in which 
any device is provided with a computational capability and is able to interact with the 
other devices it gets close to [7]. In this context, the need for scalable middleware that 
supports such interaction is growing. Such a middleware has to allow easy 
reconfiguration and plug in of new components. Moreover, it has to enable 
anonymous and multicast communication in order to support scenarios where 
components do not know what other components are around and which of them 
would be interested in theirs messages. 

The kind of middleware that at the moment seems to address these requirements is 
based on the event-based approach, where applications are structured in autonomous 
components that communicate by generating and receiving event notifications. A 
component usually generates an event notification when it wants to let the “external 
world” know that some relevant event occurred in its internal state 1 . The event 
dispatcher provided by the middleware propagates the event to any component that 
has declared its interest by issuing a subscription. A subscription, therefore, can be 
seen as a constraint on the content of the events; when an event respects the condition 



11 In the following we will use the terms event notification and event indifferently since, from 
the viewpoint of event-based middleware, we do not need to distinguish between the 
occurrence of an event and the generation of the corresponding notification. 
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stated by a subscription we say that the event is compatible with that subscription. 
The propagation of events is completely hidden to the components that generated 
them, thus the event dispatcher implements a multicasting mechanism that fully 
decouples event generators from event receivers. This provides two important effects. 
First, a component can operate in the system without being aware of the existence, 
number and location of other components. Second, it is always possible to plug a 
component in and out of the architecture without affecting the other components 
directly. These two effects guarantee a high compositionality and reconfigurability of 
a software system, and make event-based middleware particularly suited for the 
development of systems in which components operate autonomously and are loosely 
coupled. The underlying hypothesis is that an agreement exists between components 
on the structure of events. Several event-based platforms are currently available, 
either as research prototypes or as commercial products. [4] and [5] present the JEDI 
system and a taxonomy of some of the other existing platforms. 

A main problem of all systems is scalability and performance. It can be noticed, in 
fact, that if the dispatcher has to manage every component subscription and 
notification, it can easily become a bottleneck for the whole system. To solve this 
problem in JEDI the event dispatcher is implemented as a distributed system 
composed of several dispatching servers organized in a hierarchy. Each of these 
servers shares with the others part of the received subscriptions and events in order to 
guarantee that connected components communicate properly. 

In order to evaluate the performance and scalability of our solution and to identify 
possible improvements, we have taken a simulative approach. This allows us to 
determine the best alternative before actually implementing and deploying any 
possible solution. In this paper we compare through simulation the current 
implementation of JEDI with an alternative design, and we discuss advantages and 
disadvantages of both approaches. The rest of the paper is structured as follows. 
Section 2 presents an overview of the event-based middleware we are developing and 
of the alternative design we are facing with. Section 3 describes the simulation model 
of JEDI and Section 4 presents the results gathered from simulation. Section 5 
discusses the related work and, finally, Section 6 provides some conclusions. 



2 JEDI 

JEDI stands for Java Event-based Distributed Infrastructure. It supports asynchronous 
and decoupled communication among distributed elements that are called active 
objects (AOs for short). Each active object interacts with other AOs by explicitly 
producing and consuming events. Event notifications in JEDI have a name and a 
number of parameters. For instance, SoftwareReleased(Editor, 1.3, WinNT) notifies 
that version 1.3 of a software called Editor has been released for WindowsNT. 
Subscriptions are syntactically similar to notifications; they have a name and some 
parameters. A subscription is compatible with every event that has the same values 
for the same fields. To allow the creation of more flexible subscriptions the operator 
“*” has been introduced, representing a kind of wildcard. For instance, subscription 
*(Editor, *, Win*) is compatible with all the notifications (including the one above) 
having any name, three parameters, and concerning all the Editor versions that run on 
WindowsNT, Windows98, Windows2000, ... 
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2.1 Subscriptions and Event Propagation 

In JEDI subscriptions and notifications are managed by a hierarchy of dispatching 
servers (DS). Whenever an AO issues a subscription, the DS connected to that AO 
stores such subscription in its internal tables and forwards it to its parent, which, in 
turn repeats the procedure until the subscription arrives to the root DS. When a 
notification is generated, the receiving dispatching server forwards it to all its AOs 
and descendant dispatching servers that previously sent compatible subscriptions. 
Moreover, the dispatching server sends the notification to its parent that, in turn, acts 
in a similar way (without sending the notification back). Therefore, a notification can 
reach each AO that has issued a compatible subscription, regardless of the AO 
position in the hierarchy. The system guarantees that causally related notifications are 
received by AOs in the same order in which they are generated. 

The hierarchical approach is certainly more scalable of a centralized one. However 
it requires DSs to get coordinated by propagating subscriptions and notifications. 
Such coordination traffic has to be kept as limited as possible in order to ensure good 
level of performance. In the next section we propose an improvement of the approach 
presented above. 



2.2 Improving the Event Propagation Algorithm: Advertisements 

As discussed in the previous section, in the current version of JEDI, events emitted by 
AOs always reach the root of the dispatching hierarchy even if there is no subscriber 
that can be reached through the root DS. So, when the number of events emitted in the 
time unit grows, the DSs located at the higher levels of the hierarchy may become 
overloaded. To relief this situation we introduce a new event propagation algorithm 
based on the hypothesis that AOs perform a new operation called advertisement. AOs 
use advertisement to specify which kind of events they will produce. The rules that 
define the matching between events and advertisements are the same holding between 
events and subscriptions (see previous section). Moreover, we say that an 
advertisement is compatible with a subscription if at least one event compatible with 
both of them exists. AOs can dynamically issue advertisements and withdraw them 
during their life cycle. 

Each DS uses advertisements and subscriptions to create proper routing tables that 
cause events to be propagated only through paths leading to interested subscribers. In 
the new algorithm we propose, advertisements are routed toward the root of the 
dispatching hierarchy exactly as it was illustrated for subscriptions in the previous 
section. When receiving an advertisement, a DS looks for compatible subscriptions 
that are then forwarded to the sub-tree that has generated the advertisement. 

As an example, let us consider the hierarchy shown in Fig. 1 where hexagons 
represent dispatching servers, while circles represent AOs. Let us suppose that AO (3 
issues a subscription. According to the algorithm previously described, such 
subscription is notified to dispatchers D, B, and A. Now suppose that a issues an 
advertisement compatible with the subscription of (3: F receives this advertisement 
and forwards it to B that detects the compatible subscription. Hence B propagates this 
subscription to F (notice that the subscription now is stored on every DS that connects 
a to (3). Finally B propagates the advertisement to A that checks if the sub-tree rooted 
at C has communicated other compatible subscriptions. If not, it simply stores the 
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advertisement in its internal tables. Thanks to this subscription propagation approach, 
all the events generated by a are routed to (3 through dispatcher B, and do not reach A 
unless a compatible subscription has been issued by any AO connected to A or to the 
subtree rooted by C. 




Fig. 1. Connections between dispatching servers and agents 

Of course, new subscriptions compatible with an advertisement can be issued even 
after an advertisement has been propagated (or during the propagation). A dispatcher 
that receives a subscription, while routing it toward the root of the hierarchy, checks 
in its internal tables for compatible advertisements, and, if any, routes the subscription 
toward the senders of these advertisements. 

Summarizing, advertisements and subscriptions are forwarded up to the root. In 
addition, subscriptions are sent toward advertisements so that they can be used to 
create virtual paths between the event producers and consumers. 

To further limit the traffic caused by propagation of subscriptions and 
advertisements, we have exploited some optimization techniques introduced in [2] 
and [3], In these papers it is shown that, given a pair of advertisements (or 
subscriptions) x and y, it may happen that all the events compatible to y are 
compatible to x too. In this case we say that x covers y. Whenever an advertisement 
(subscription), issued in a subtree of the dispatching hierarchy, is covered by another 
advertisement (subscription) issued in the same subtree, such advertisement 
(subscription) does not need to be propagated toward the root since it does not 
provide any additional information. 



3 The Simulation Model 

In order to define a simulation model that properly describes the interesting 
characteristics of JEDI, we have collected numeric data on the behavior of two 
existing applications that have been developed on top of JEDI in the past years. The 
analysis of such mass of data has allowed us to identify the main components of the 
simulation model and their principal characteristics. 

The main components of the simulation model describe dispatching servers and 
active objects. Dispatching servers are described in terms of the operations they can 
perform, (i.e., manage subscriptions, advertisements, notifications, ...) and the 
distribution of the time spent in performing these operations as they have been 
determined from the measured data. Since the advertisement mechanism has not been 
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developed so far in JEDI, the time needed to process advertisements has been 
estimated by considering that the algorithm that manages them is similar to the one 
used for subscriptions and exploits similar data structures. 

Regardless of the application-dependent task they perform, AOs interact with the 
event-based middleware to issue subscriptions and advertisements, and to generate 
events. Based on this, for the purpose of our model, we have identified four 
elementary behaviors for AOs and we have associated them to proper simulation 
components called agents. Based on the behavior they embody, agents can be of four 
different types: sinks, sources, proactive and reactive agents. A source represents an 
AO that can only send events (and advertisements) during its life. A sink is able only 
to receive event notifications (and to send subscriptions). Proactive agents take the 
initiative by generating events and then wait for some reply events. Reactive agents 
wait for events and then generate new events. 

In event-based middleware, inputs to the dispatching hierarchy (notifications, 
advertisements, and subscriptions) are correlated through the compatibility and 
covering relations defined in Section 2. This correlation captures the characteristics of 
the applications built on top of the middleware and it has an impact on the load of 
dispatching servers and on the network traffic. We have created a synthetic 
correlation model that, at simulation startup, automatically generates subscriptions, 
advertisements, and notifications to be assigned to each agent. This correlation model 
is based on few parameters that describe various application-dependent phenomena 
such as the distance covered by notifications in order to reach their subscribers, the 
coverage relationships defined on advertisements and subscriptions, etc. 

The main parameter we have introduced is the spreading coefficient (sc). It 
provides an indication of the distance covered by events in the dispatching hierarchy. 
This coefficient is defined in the [0, 1] interval and occurs in the following formula: 

R(n) = sc n (D 

R(n) is the probability that an event issued by an agent directly connected to a 
dispatcher A has to be received by some agent directly connected to a dispatcher at a 
distance n from A. The distance between dispatchers is calculated on the basis of the 
topology of the dispatching hierarchy. For instance, in the topology of Fig. 1 
dispatchers A and D are at a distance 2. Intuitively, the more sc (spreading 
coefficient) is close to 1, the more it is likely that events have to be spread across the 
whole hierarchy of DSs. Conversely, when sc is low, events and corresponding 
subscriptions are mostly localized in some dispatching hierarchy sub-tree. During 
simulation, we use the spreading coefficient as an input parameter, in order to test the 
behavior of the event-based system in different event propagation conditions. To 
capture the effect of covering relations for subscriptions and advertisements, we have 
introduced other simulation parameters that for space reasons are not be presented in 
this paper. 

The environment we have selected to perform simulations is called OPNET [6]. In 
OPNET a model of the system to be simulated is defined by selecting components 
from proper libraries and by defining the way these components are connected 
together. OPNET provides a number of predefined libraries that model network 
components such as hubs, routers, and TCP/IP stacks. Using these libraries it is 
possible to easily define simulation models of local networks as well as WANs and 
wireless systems with mobile objects. In addition, OPNET allows users to define their 
own libraries to model the behavior of specific application-dependent elements. We 
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have exploited this feature to define our simulation model and we have relied on the 
existing libraries as for modeling the underlying network infrastructure. 

Before starting a simulation, the simulation model is customized by assigning 
values to the following parameters: 

• The characteristics of the underlying physical network in terms of connectors 
bandwidth, latency, protocols, and topology. 

• The topology of the system, i.e. the structure of the event dispatching hierarchy and 
their location on the physical nodes. 

• The spreading coefficient and the parameters associated to the covering relations. 

• The number and location of connected agents and their types. 

• For each agent, the mean values of the distributions defining their life cycle 
(number of subscriptions, notification frequency, ...). 



4 Simulation Results 

The goal of simulation has been to understand when the usage of advertisements is 
advantageous in term of performances of the most critical parts of the system (root 
DS and network channels) and what happens when the bandwidth of network 
connections between dispatching servers decreases. 

In order to set up the simulation scenarios, we have first identified the operational 
conditions of our system. For instance, we have defined the maximum number of 
events manageable by a dispatching server, we have analyzed the relations between 
the number of physically different nodes of computation and the load on the LANs, 
we have observed the bandwidth consumed by notifications to define the proper 
channel sizes to interconnect the various LANs, etc. 

During simulation we have assumed that dispatchers are organized in a quaternary 
balanced tree. We have considered trees having 5, 21, or 85 dispatching servers 
organized in trees of 2, 3, or 4 levels respectively. Each dispatching server manages 
52 agents located on 4 hosts connected to the same LAN. The number of agents 
connected to a dispatching server has been determined on the basis of some 
preliminary simulations in order to avoid saturation of the root DS. In all simulation 
scenarios the 52 agents connected to a dispatching server are categorized as follows: 
16 proactive agents, 24 reactive agents, 8 sources, and 4 sinks. Proactive agents and 
sources send events every 8 seconds on average. The duration of simulation has been 
selected so that the initial transitory can be disregarded. The LANs connecting a 
dispatching server to its agents are lOMbit/sec Ethernet networks. These are 
connected together through IMb/sec or 64 kb/sec communication channels, 
depending on the scenario being considered. We have always adopted the TCP 
protocol. 

Fig. 2 and Fig. 3 compare the performance of the root DS when event propagation 
is exclusively based on subscriptions and when subscriptions and advertisements are 
used together to define the event routing tables. We call these two cases 
subscriptions-based and advertisements-based, respectively. The figures show the 
average percentage of time spent by the root dispatching server (i.e., the potential 
bottleneck of the system) in managing notifications (Fig. 2) and subscriptions (Fig. 3) 
plotted against the spreading coefficient introduced in Section 3. Intervals are drawn 
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with a confidence level of 90%. The results are referred to a dispatching hierarchy of 
21 dispatching servers connected through 1 Mb/sec links. 

Fig. 2 shows that in the advertisements-based case the root DS spends much less 
time in handling notifications compared to subscriptions-based case. In fact, in the 
first case, the root DS has to handle only the events which actually need to be routed 
in subtrees different from their originators, while in the latter case, it handles all the 
events that are generated in the system. In the subscriptions-based case the growth in 
the spreading coefficient causes more events to be transmitted to other sub-trees thus 
resulting in more work on the side of the root dispatching server. Because of the 
growth in workload, in this case, the root dispatching server gets saturated for 
spreading coefficients higher than 0.4. In this case, therefore, the advantages of the 
advertisements-based approach against the subscriptions-based one is quite relevant. 
This result is also confirmed by the graphic of Fig. 3 showing that the advertisements- 
based approach does not result in an appreciable growth in processing time for 
handling subscriptions propagation for the root DS. We have also analyzed the 
processing time of advertisement, and we have noticed that this value is quite low 
(between 2.65% and 2.9%) and does not seem to be influenced by the spreading 
coefficient. 



Tunc spent processing nobficattcns Tefal simulation time 



Without advertisement 
I With advertisement 




Fig. 2. Time spent by the root dispatching server in processing notifications 

Similar simulations have been performed with dispatching hierarchies of 5 and 85 
dispatchers. In the case of 85 dispatcher the subscriptions-based approach is not 
usable because, even for sc=0. 1 the root DS gets saturated. The case of 5 dispatchers 
shows that for a small system the two approaches produce very similar results, with a 
difference of at most 5% in favor of the advertisement-based case. 

The advantages of the advertisements-based algorithm increase when the 
bandwidth of the communication channels dedicated to the traffic of the event-based 
infrastructure decreases. This case applies when the event based middleware is 
deployed across the boundaries of a single organizations. Fig. 4 shows the results of a 
set of simulations performed on a model with 21 dispatchers connected among each 
other through 64 kb/sec links. The graphics show the utilization of communication 
links connecting dispatching servers at different levels in the hierarchy. The graphics 
on the left-hand side refer to the traffic directed toward a dispatching server and its 
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controlled AOs (they reside on the same LAN), while those on the right-hand side 
refer to the traffic directed in the opposite direction. When the subscriptions-based 
approach is exploited, the 64 kbit/sec channels entering into the root DS get saturated 
for values of sc higher than 0.5. Conversely, the traffic outcoming from the root DS is 
more limited than in the advertisements-based approach. In this last case, in fact, 
subscriptions are sent downward to establish the routing between senders and 
subscribers. 
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Fig. 3. Time spent by the root dispatching server in processing subscriptions 

Concluding, in general the advertisements-based approach offers better 
performances in case of large systems where communications tend to be localized 
among groups of neighbor components. The advantages of the approach compared to 
the subscriptions-based approach are particularly evident when low speed 
communication channels are used. The subscriptions-based algorithm remains 
attractive for its simplicity in those cases where the dispatching infrastructure is not 
working under heavy load. 



5 Related Work 

While a number of researches and practitioners are focusing on the development of 
event based middleware, we know of few efforts devoted to understand performance 
and scalability of such middleware. A discussion of some preliminary requirements 
for scalable event-based middleware is presented in [7]. In particular, authors point 
out at the importance of having mechanisms for limiting network traffic among 
distributed dispatching servers located on wide-area network. With this respect, the 
advertisement algorithm we have developed for JEDI addresses, at least partially, 
some of these requirements. In the context of commercial systems, we are aware of 
two middleware, Smartsockets [8] and TIB/Rendezvouz [9], providing a distributed 
implementation of the event dispatcher. In both cases, however, distribution seems to 
be exploited to achieve reliability on a small size system more than scalability due to 
the massive distribution of components and efficiency of event propagation. 
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The problem of providing mechanisms for efficient multicast of events is tackled in 
[1]. In this paper authors present a new algorithm to verify the compatibility of 
notifications with subscriptions. Differently from what we do, they assume that 
subscriptions are known in advance at any node of the dispatching server network, 
and do not focus on how they are actually propagated. Based on this information, they 
can establish optimized paths for event notifications. The performances of the 
proposed algorithm are evaluated through simulation. 
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Fig. 4. Utilization of links between LANs at various levels in the hierarchy 

Our work has its premises in a previous work of Carzaniga et al. ([2] and [3]), where 
an advertisement approach is presented and simulation is used for the first time in the 
context of the evaluation of event-based middleware. Differently from them, we have 
focused our effort on hierarchical systems. Moreover our algorithm avoids that 
advertisements and subscriptions floods the network of DSs. In the simulation model 
we have introduced two additional types of agents (proactive and reactive) and we 
have defined a model of locality based on the spreading coefficient. Also, we have 
tuned the simulation parameters by analyzing existing applications and relied on 
existing and proved models of network devices and protocols provided by a 
commercial simulator. 
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6 Conclusions 

We have shown that the development of a scalable event-based middleware such 
JEDI requires special care in the definition of the mechanisms that allow dispatching 
components to distribute events regardless to the physical location of their originators 
and subscribers. We have exploited simulation with the purpose of validating the 
design alternatives we have defined. In particular, we have shown that the 
advertisements-based approach works well when event traffic is localized, while the 
subscription-based approach provides acceptable performances when events have to 
be dispatched to components distributed all over the dispatching hierarchy, assuming 
that the root dispatcher has been properly dimensioned. Based on the above 
observations we argue that the choice of the event propagation model depends on the 
purpose and structure of the application that is going to exploit it. Therefore, event- 
based infrastructures should not bundle themselves on a specific approach. Instead, 
they should be designed in such a way that the application designer is free to choose 
the approach that better suits his/her needs. 

We are currently consolidating the simulation model and validating it through a 
proper analytical model. We aim at defining some load balancing mechanisms that 
avoid or contrast saturation in dispatching servers. Finally, we are extending the 
semantics of our event-based middleware by introducing new operations such as the 
possibility of generating events expecting a reply from their receivers. 
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Abstract. Automatic parallelization is known to an intractable problem in gen- 
eral. This paper is about a new approach in which domain-specific knowledge is 
used to facilitate automatic parallelization. The research focuses on three widely 
used numerical methods: the finite difference method (FDM), the finite element 
method (FEM), and the boundary-element method (BEM). A prototype tool, 
called the ParAgent, has been developed to study the feasibility of the approach. 
The current version of the prototype can parallelize Fortran-77 programs based 
on the explicit time-marching FDM. The paper provides an overview of the new 
approach and some results of its application including the parallelization of the 
NCAR/Penn State Mesoscale Meteorology Model MM5. The manual paralleli- 
zation of MM5 took about three years whereas the parallelization using Par- 
Agent was done in about two weeks. 



Introduction 

Scientists have invested enormous time and effort into developing and refining legacy 
code for engineering and scientific applications. The solid pedigree associated with 
these legacy codes makes them very valuable. Computation and data intensive legacy 
code stand to benefit from application of parallel computing. For example, a 24-hour 
simulation using the mesoscale climate model MM5 [3] runs for close to two hours on 
a workstation. Long term climate studies to study global warming require 100-year 
climate simulations that would entail 8-year long runs! Clearly, parallel computing is 
imperative for such large-scale simulations. 

The manual parallelization of legacy codes is time-consuming and prone to errors. 
To address this problem, considerable research effort has gone into developing auto- 
matic parallelization tools [6,12,16,20]. However, attempts to develop fully automatic 
parallelization tools have not yet succeeded [5,8,9]. So far the research has been 
mainly focused on parallelization of arbitrary sequential programs. But, the paralleli- 
zation problem is too difficult at this level of generality and several sub-problems are 
known to be NP-complete. Another shortcoming of existing tools is that they depend 
mainly on syntactic analysis of programs and lack an effective way of dealing with 
the complex semantics of legacy code. Multiple factors including the physics of the 
problem, the mathematical model, the numerical technique contribute to the complex 
semantics. 

We have developed a domain-specific interactive automatic parallelization tool 
called ParAgent [19,22]. The current system has the capability to process three- 
dimensional time-marching explicit finite difference codes written in Fortran-77 and 
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produce parallel programs for distributed memory computers. This source-to- 
source translation tool is based on a new user-assisted approach in which domain- 
specific knowledge is used to facilitate automatic parallelization. Our approach to 
automatic parallelization is conceptually similar to the knowledge-based approach for 
program understanding adopted in software engineering [15,18,25]. 

Parallelization of the MM5 [3] mesoscale climate model code has been previously 
attempted using several different tools including the commercial tool FORGE [12]. 
As far as we know, none of the attempts has been successful. A team of four re- 
searchers at the Argonne National Laboratory worked for three years to produce a 
manual parallelization of MM5. Using ParAgent, a postdoctoral student parallelized 
MM5 in about two weeks. 

This paper provides an overview of ParAgent and the results obtained by using it. 
The paper is organized as follows: the second section describes our approach to auto- 
matic parallelization; the third section describes ParAgent and its capabilities; the 
results and conclusions are presented in the fourth and fifth sections respectively. 
Details about the domain-specific approach and the architecture of ParAgent can be 
found in [22], 



The Domain-Specific Approach 

The main principle behind our approach is to design a structured parallelization proc- 
ess that incorporates knowledge about the underlying numerical method to facilitate 
automation. Note that a few numerical methods such as the finite-difference method 
(FDM), the finite-element method (FEM), and the boundary-element method (BEM) 
form the basis for a large majority of the legacy code for engineering and scientific 
applications. 

Another important aspect of our approach is to blend automation and user assis- 
tance to provide a pragmatic solution. Parallelization involves many decisions and a 
lot of detailed work. Tedious and time-consuming tasks are the targets of automation 
and high-level decisions that can benefit from an expert advice are left to the user. For 
example, complex data dependency analysis is automated but the choice of paralleli- 
zation strategy is left to the user. It is like a professor-student model where the profes- 
sor may direct the parallelization and student works out the detail and implements the 
strategy suggested by the professor. 

The analysis needed to arrive at a parallelizing scheme automatically is prohibi- 
tively complex and not practical. In our approach, the user suggests a high-level par- 
allelization scheme, and ParAgent performs the data-dependency analysis to identify 
the communication patterns. This type of analysis is tedious, time-consuming and 
likely to cause errors if done manually; ParAgent provides critical help by automating 
it. For example, the user specifies that the MM5 be parallelized along I and J dimen- 
sions representing X and Y axes in the horizontal plane. ParAgent analyzes all vari- 
able that depend on the I and the J indexes to identify the communication patterns and 
their placement. This requires inter-procedural dataflow analysis [14] involving hun- 
dreds of variables and complex control structure spanning across subroutines. 

Based on ten years of experience with parallelization of large scientific codes, we 
have designed an interactive and structured parallelization process that has three 
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phases: diagnostics, communication analysis, and code generation. The first phase 
requires user assistance and the other two phases are completely automatic. 



Diagnostic Phase 

The parallelization process using ParAgent follows a strategy commonly used by ex- 
pert programmers. A programmer first acquires knowledge about the numerical 
method and the specifics of how the method is applied in a given code. At this stage, 
consultation with the domain expert is often beneficial to resolve ambiguities and 
confirm the understanding about a given program. The diagnostic phase of ParAgent 
essentially mimics such consultation. During this phase, ParAgent helps the user by 
providing concise information about the code. 

To illustrate the diagnostic phase, consider the MM5 climate model. In this model, 
the user first specifies that it is based on the FDM and then processes each source file 
of the serial program through the ParAgent. The objective is to check if the program 
adheres to the finite differencing scheme. For example, ParAgent detected a problem 
with an array variable UJ1(I). A domain expert pointed out that UJ1 actually repre- 
sents a slice of a 2 dimensional array U(I,J) with J set to 1. This sort of variable 
aliasing makes it hard to parallelize MM5. ParAgent identifies all uses of the variable 
UJ1 and helps the user modify the code to replace UJl(I) with U(I,1). 



Communication Phase 

ParAgent uses a two-step process to identify communication. In the first step commu- 
nication patterns are identified based on the indexing patterns of array variables. 
During this step, the objective is to ensure that all communication points and all vari- 
ables to be communicated at each point are identified. The resulting number of com- 
munication points identified is often very large. The second step is for optimization of 
communication. This step analyzes communication to eliminate redundancy, performs 
grouping of messages, and finds optimal placements to reduce the number of commu- 
nication points. 



Code Generation Phase 

The last phase of automatic parallelization primarily deals with two issues: global-to- 
local index transformation and insertion of communication primitives. Note that the 
hard task of identifying the communication pattern and synch/exchange points is done 
prior to this phase. This stage automates a lot of repetitive work such as changing the 
loop control statements to reflect the transformation to the local indices. 



Advantages 

This new approach has distinct advantages over existing approaches to compiler sup- 
port for automatic parallelization [1,7,10, 1 1,12,16, 1 7,20,21 ] . It allows the user to pro- 
vide high-level information such a parallelization strategy. Without such information. 
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the parallelization can become an intractable search. The existing approaches do not 
include a diagnostic phase to analyze the code based on the characteristics of a nu- 
merical method. The diagnostic phase is needed to identify the quirks in the code as 
illustrated by an example above. Without the removal of such quirks, the existing 
tools are unable to perform the analysis necessary to identify communication. Once 
the code is diagnosed to ensure that it conforms and follows the characteristics of the 
numerical method, it is possible to simplify the parallelization process. For example, 
the communication analysis is based on a known template instead of an exhaustive 
data dependency analysis. 



ParAgent 

ParAgent is a parallelization tool built using the domain-specific approach described 
in the previous section. The current prototype addresses explicit time marching FDM. 
It takes serial Fortran-77 codes as input and produces parallel programs for 
distributed memory computers. 

On launching the application, the user is presented with an inquiry screen. Here the 
user provides information about the class and the location of the code, the binding 
information (i.e. the loop indices that correspond to the grid dimensions), and the de- 
sired parallel mapping directions. Next, the user is presented with the diagnostics 
screen. The user can submit selected files for diagnosis. The system analyses the se- 
quential program for inconsistencies and ambiguities based on the specific character- 
istics of the numerical method. Any inconsistency and/or ambiguity found are brought 
to the user's attention with help information about the possible reasons for the error. 
The user at this point can make changes to the code to fix the problem. When the code 
passes the diagnostics phase, the systems marks it as diagnosed. 

When all the files have been successfully diagnosed, the user can move to the next 
screen, the parallelization screen. The user can select the underlying communication 
library (example RSL, MPI etc) and can select files for parallelization. Parallel code 
using embedded communication library calls is generated. 



Capabilities 

This section describes some of the important capabilities of ParAgent. These are 
grouped into diagnostic capabilities, visualization capabilities, and parallelization 
capabilities. 

Diagnostic Capabilities 

ParAgent automatically finds the correspondence (if any) between an array dimension 
and the grid dimension for each array variable after loop indices have been bound to 
grid dimensions. In some cases, user interaction is required to resolve array-space 
map conflicts such as for ambiguous uses of loop indices. This information is gener- 
ated only once for each variable. 

ParAgent allows files to be diagnosed one at a time. During this step, ParAgent 
verifies that the serial code conforms to the characteristics of the underlying numeri- 
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cal method. The system can identify conflicts such as data exchange patterns not con- 
sistent with the differencing scheme. Auxiliary information and specific references to 
the code are provided to help the user in resolving these problems. Each file needs to 
be diagnosed only once. 

Visualization Capabilities 

ParAgent traces the sequence of subroutine calls and displays the call-graph. This 
helps the user to understand the structure of the code. After parallelization, the call- 
graph contains annotations to indicate subroutines that will have communication. 

After analyzing communication requirements, ParAgent displays the block struc- 
ture of the code. The user can click on a block to view the serial code that corresponds 
to the block. Each block usually represents many lines of code that can be embedded 
in a parallel program as the code to be executed sequentially at each individual proc- 
essor. The block structure also identifies the communication exchange points in the 
code. 

For each variable, ParAgent displays the data exchange patterns at each of the 
communication points. The user can view these patterns in the form of stencils 
showing the communication that will occur in parallel processing. The stencil display 
shows the underlying differencing scheme at work. 

Parallelization and Optimization Capabilities 

ParAgent provides two alternatives for parallelizing a subroutine. In one alternative, 
the effects of subroutine calls from within the procedure are ignored. In the other al- 
ternative, inter-procedural analysis is performed to analyze the effects of all the sub- 
routine calls. Subroutine calls may introduce additional communication and the selec- 
tive process allows the user to view the effect of the different subroutine calls. 

The diagnostic phase may take a couple of weeks for a large code consisting of 
several hundred files. However, the parallelization phase is automatic and takes a few 
minutes. During this phase, the actual parallel code is generated. 

ParAgent performs inter-procedural analysis to determine global communication 
requirements and to perform communication optimization. Communication overhead 
is a function of the number of messages and the size of the messages. ParAgent re- 
duces the number of total messages by: a) coalescing all messages going to the same 
processor at a given exchange point b) optimizing the number of data exchange points 
and c) by moving data exchange out of loops and subroutine calls whenever possible. 



Results 

ParAgent has been used to parallelize the following three climate and environmental 
modeling codes: MM5, RADM, RAMS. ParAgent runs on UNIX workstations or PCs 
running under LINUX. The input is Fortran-77 code and the output is a SPMD code 
using the MPI library for communication. 

ParAgent has been demonstrated at several sites: the EPA/NOAA Atmospheric 
Sciences Modeling Division in North Carolina, a UNESCO-funded workshop on the 
project to Intercompare Regional Climate Simulations, the Center for Development of 
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Advanced Computing in India (CDAC), the National Center for Atmospheric Re- 
search (NCAR), and the Pacific Northwest National Laboratory (PNNL). 

In this section, we present results of parallelization of two legacy codes. 



Table 1 . Performance of parallelized FDM benchmark (in seconds) 
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A FDM Benchmark 

The first code is a FDM application that was obtained from Dayton University. This 
application is useful to assess the effect of problem size on performance of the paral- 
lel FDM program. This FDM application consists of 1443 lines of code distributed in 
9 files. A team of two graduate students spent about 20 hrs each to manually paral- 
lelize this code. Using ParAgent, a different graduate student was able to parallelize 
this code in 2 hours. 

The parallelized code was run on the Alice Cluster [2] in Ames Laboratory at ISU, 
which has 64 PCs connected by a fast Ethernet switch. Each PC in the cluster has a 
200 MHz Pentium processor with 256KB cache and 256MB of main memory. The 
results of the parallelized code were compared to the serial version and verified to be 
accurate. The performance timings for the parallelized FDM benchmark are shown in 
Table 1. 



MM5 

The second code is the NCAR/Penn State Mesoscale Model MM5 [3]. This is a 
widely used mesoscale meteorology model. This is a complex Fortran 77 code with 
more than 200 files, several hundred variables with many levels of nested subroutines 
and complex control flow including loops spanning over several hundred lines of 
code. The parallelized code was run on a 64-node IBM SP1 machine. Each node had 
64-MB memory. 
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The results of the parallelized code were compared to the serial version and veri- 
fied to be accurate. The performance results for a 24-hour simulation of the paral- 
lelized MM5 code are shown in Table 2. 



Table 2. Performance of parallelized MM5 (in seconds) 



Processors 


Grid Size 


32x32x23 


64x64x23 


1 


2760 


18000 


4 


750 


4520 


16 


225 


1350 


64 


120 


420 



Conclusion 

The prototype tool, ParAgent, validates the feasibility of our new approach. ParAgent 
has been used to successfully parallelize several large codes for climate and environ- 
mental modeling. Parallel code obtained from ParAgent has been verified to be accu- 
rate as well as efficient. Use of ParAgent can save considerable amount of time and 
effort in parallelization of large and complex legacy codes. 
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Abstract. The Java programming language and the underlying virtual machine 
model have introduced new complexities for compilation. Various approaches 
ranging from just in time (JIT) compilation to ahead of time (AOT) compilation 
are being explored with the aim of improving the performance of Java programs. 
The hurdles facing the achievement of high performance in Java and the strengths 
and weaknesses of different approaches to Java compilation are addressed in this 
paper, specifically within the context of SGI’s effort to provide a high-performance 
Java execution environment for its computing platforms. The SGI JIT compiler 
and prototype AOT compiler are described, and performance results are presented 
and discussed. 



1 Introduction 

The Java language was designed to allow for fast and convenient development of very 
reliable and portable object-oriented software. Java applications are generally distributed 
in a platform-independent bytecode format and are tightly integrated with a number of 
libraries that foster software reuse and shorten development time. A Java execution 
environment thus typically consists of an interpreter or compiler accompanied by a large 
runtime environment. However, a number of these characteristics introduce additional 
complexities in the compilation process: 

- Portable executables. Executables are not distributed natively compiled. Instead, 
a portable bytecode format is used, requiring either interpretation or a native com- 
pilation cycle before execution. 

- Strict error checking at run time. A wide array of checks for exceptional conditions 
and erroneous code (e.g., out-of-bounds array accesses) are performed. 

- Lazy resolution of field and method accesses. Java solves what is known as the 
fragile base class 0 problem by laying out objects and method tables at link time 
and resolving references to them lazily, as opposed to at compile-time as in C++. 
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- Polymorphism and object-oriented coding styles. Methods are virtual by default, 

leading to heavy use of expensive dynamic method dispatch. Typical object-oriented 

coding styles encourage the use of numerous small methods, making method invo- 
cation a performance bottleneck. 

Consequently, the designers and implementors of commercial Java execution environ- 
ments face design problems and choices that are rarely found in other languages. Multiple 
choices for executing Java bytecode need to be considered, including interpretation, just 
in time (JIT) and ahead of time (AOT) compilation, and an implementation may in fact 
combine multiple of these techniques. 

The rest of this paper is organized as follows. SectionQdctails some choices available 
for Java compilation and discusses their suitability for different application scenarios. 
Sectional presents JIT and AOT compiler implementations for the SGI platform, and 
sections 0 and |5] compare and discuss their compile-time and run-time performance, 
respectively. Section[5| discusses related work and sectionQ presents conclusions. 

2 Choices for Java Compilation 

Interpretation was the first choice available for Java implementations and helped accel- 
erate early Java adoption due to its rapid retargetability. Its poor performance, however, 
quickly led to the proliferation of just in time (JIT) compilers, which translate bytecode 
to native code at run time, executing and caching it. A major benefit of JIT technology is 
that it integrates seamlessly into the Java execution environment and remains transparent 
to the application programmer and user. However, as the time required for compilation is 
added to the overall execution time of a program, time-consuming optimizations must be 
used sparingly or replaced by less costly algorithms. Code quality remains poor in com- 
parison to traditional compilers due to this design requirement of fast compilation. Most 
modern systems with JIT compilers can be described as mixed-mode, in that they com- 
bine an interpreter with a JIT compiler: the interpreter runs initially and collects profiling 
information, and performance-critical methods are identified and compiled as execution 
progresses. This in turn allows the creation of JIT compilers that implement more costly 
optimizations, since they are invoked more selectively. Extending this approach further, 
multiple JIT compilers at different levels of sophistication can be employed within a 
single virtual machine [4j. 

With the growing importance of Java for server-side applications and the continuing 
demand for higher performance than that provided by early JIT compilers, implemen- 
tors attempted to leverage existing mature compiler infrastructure for Java either by 
translating Java to C or by connecting a Java front end to a common optimization and 
compilation back end. The result of this approach is a system that compiles Java to native 
code ahead of time (AOT). Such compilers can produce completely static standalone ex- 
ecutables, or they can work within the context of a traditional virtual machine which also 
supports interpretation or JIT compilation of dynamically loaded bytecode. Distributing 
applications in an AOT-compiled form, as opposed to distributing bytecode, gives de- 
velopers more control over the environments in which their applications are deployed 
and more protection against decompilation and reverse engineering. However, AOT- 
compiled code obviously lacks the simple portability of bytecode, and AOT systems 
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tend to have lower levels of conformance with the standard Java platform, particularly 
with its dynamic aspects. AOT systems offer the hope of higher performance than that 
available with traditional virtual machines and JIT compilers, but have yet to make that 
promise a clear reality. 

3 Implementation of SGI JIT and AOT Compilation 

SGI’s Java execution environment for Irix/MIPS is currently shipping with a JIT com- 
piler. In addition to these released products, SGI has developed a prototype AOT com- 
piler which has been integrated with the virtual machine and JIT compiler to form a 
mixed-mode execution environment. 

3.1 Implementation of the SGI JIT Compiler 

The SGI JIT compiler is designed to work in conjunction with a virtual machine derived 
from Sun’s reference implementation, communicating with it using the JIT API defined 
by Sun. Designed for very fast compilation, it has a simple structure and constructs 
no intermediate representation or control flow graph; bytecode is simply translated to 
machine code in a table-driven manner. This translation generally happens individually 
for each bytecode instruction, the only exception being two-instruction compare/branch 
sequences, which are translated together as a pair in order to generate code of reasonable 
quality. Complex bytecodes, such as those involving object allocation or complicated 
exception checks are translated into calls to runtime routines. Methods are translated 
lazily just prior to their first execution. 

Each method begins with a prologue which sets up a Java stack frame as defined 
by the JVM, initializing it to the extent necessary for exception handling and garbage 
collection. Most of this prologue, as well as other common code sequences, are kept in 
a common code area for shared use by all methods, reducing code size. 

The compiler uses a very fast, primitive form of global register allocation: frequently 
used method arguments, local variables and constants are mapped one-to-one to global 
registers for the entire method. This algorithm uses no live range information and makes 
no effort to allocate operand stack items to registers. Everything not chosen for global 
allocation is allocated within basic blocks to temporary registers and saved to memory 
at block boundaries, before method calls, and whenever an exception may be thrown. 
Additionally, register spilling may occur within a basic block due to register pressure. 

Resolving a bytecode instruction’s symbolic reference to a field or method is spec- 
ified in detail in the Java virtual machine specification eh. Although resolving these 
references eventually has to produce field or method table offsets to enable correct exe- 
cution, that resolution cannot legally take place until the instruction is actually executed. 
For this reason the compiler defers generation of the final code for these references and 
instead emits a branch to runtime routines implementing the required resolution and 
offset calculation. Before control is returned to the JIT-compiled code, the branch is 
overwritten with the final code using the calculated offset. 

Runtime exception checks are translated into explicit tests in the emitted code, 
branching to runtime routines to unwind the stack and transfer control to the appro- 
priate exception handler. 
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3.2 Implementation of the SGI AOT Compiler 

The SGI AOT compiler prototype has a novel design: rather than providing its own run- 
time system and producing native executables, it produces dynamically linked libraries 
which operate in the context of our Java virtual machine. The AOT system consists of a 
Java front end, a traditional multi-language compiler back end, and runtime support. 

The front end, fej, reads in Java class files, converts the bytecodes to its own 
intermediate representation, constructs a control flow graph, and then performs high- 
level optimizations, including type inferencing and the associated removal of certain 
runtime checks. The methods are next translated into a representation suitable for the 
back end of the SGI MIPSPro compiler 1T51 , a highly optimizing compiler originally 
developed for C, C++, and Fortran with recent modifications for Java. The MIPSPro 
compiler performs all of the traditional scalar compiler optimizations as well as advanced 
memory hierarchy optimizations. As it includes far more sophisticated optimizations 
than the JIT compiler, its compilation time is substantially higher. The output of AOT 
compilation is a dynamically linked library which contains one function symbol for each 
Java method in the given class. 

Aside from the exceptions described below, the execution model for JIT-compiled 
and AOT-compiled code is very similar, allowing them to share a large amount of runtime 
support code. For each loaded class, the JIT compiler runtime support routines search 
for a AOT compiler-generated library corresponding to that class and then locate the 
compiled code for each method. If no compiled code is found for a given class or method, 
the JIT compiler is invoked instead. JIT-compiled and AOT-compiled methods share the 
same calling convention, allowing them to coexist without performance penalty. 

Due to the nature of the SGI linker and operating system, code in a library is not 
rewritten at run time, so the AOT compiler does not use the same scheme as the JIT 
compiler for bytecode instructions which symbolically reference fields and methods. 
Instead, code is generated which causes a conditional trap to the operating system if 
a reference has not yet been resolved, and the trap handler performs the resolution. 
Following that, code is generated which simply accesses the class constant pool explicitly 
to load the offset necessary for the instruction. This results in larger and slower code 
than with the JIT compiler, where the rewritten code does not need to access the constant 
pool. Exception checking is also done by means of conditional traps. Management of 
exception handlers and stack unwinding is done using the standard DWARF m format 
for object file debugging information and the DWARF runtime support library. These 
mechanisms are far slower than the custom exception handling and stack unwinding in 
the JIT compiler runtime support. 



4 Compile-Time Performance for Java Compilation Methods 

This section and the following section present compile-time and run-time performance 
results, respectively, for the SGI JIT compiler, the SGI AOT compiler, and the third-party 
AOT system TowerJ iffTSl l on the SGI platform. TowerJ is a clean-room, multi-platform 
Java AOT compiler whose emphasis is on server-side Java performance. Its current 
claim to fame is achieving the highest reported performance on VolanoMark 2.1 ffE2, 
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measured on IA-32 Linux. Some compile-time experiments for non-SGI platforms are 
also included. 

The benchmarks used were SPEC JVM98 1.03 1161 . the jBYTEmark 0.9 suite of 
small Java kernels, Embedded CaffeineMark 3.0 Q, and VolanoMark 2. 1 . For our tests, 
SPEC JVM98 was modified to eliminate use of AWT, which is not supported by the 
TowerJ implementation. The benchmarking platforms included: an SGI Origin 2000 
(300 Mhz R 10000) with SGI’s JDK 1.1.6, with and without the AOT compiler, SGI’s 
Java2 vl.2.2, and TowerJ 3.3.bl; a Dell Pentium II (350 Mhz) running Red Hat Linux 
with IBM’s JDK 1.1.8 ®; and a Sun UltraSPARC (143 MHz) with Sun’s Java2 vl.2.2 
with the HotSpot 1.0 server compiler 111 71 . 

4.1 Comparison of SGI JIT and AOT Compilation Time for SPEC JVM98 

JIT compilation of the entire SPEC JVM98 suite takes 2.4 seconds, which is a statistically 
insignificant addition to the SPEC JVM98 execution time of approximately 7 minutes. 
However, the SGI AOT compilation time is approximately 30 minutes and the TowerJ 
compilation time is in the same order of magnitude, both easily dwarfing the benchmark’s 
run time. Obviously the technology in both of these compilers would be unsuitable in 
the context of a JIT compiler, and this increase in compile time should make possible 
an appreciable improvement in run time. 

4.2 JIT Compile Time vs. Run Time 

The SGI JIT was designed for fast compilation, as described in section POl and illustrated 
by the results in section EEU However, it is interesting to examine other data points on 
the design spectrum of compile time vs. run time. 

JIT compilers do not in general provide ways of reporting compilation time. Never- 
theless, one way of measuring its effect is to run a small kernel repeatedly within one 
virtual machine invocation and to measure the run time of each iteration. Each iteration 
will include some combination of bytecode interpretation, JIT-compiled code execution, 
and JIT compilation, depending on the design of the JIT compiler and its interaction with 
the virtual machine. The run times of the iterations should be expected to decrease over 
time, as more time is spent in JIT-compiled code execution and less time is spent in 
bytecode interpretation or JIT compilation. 

Figure [Dshows run times over three iterations of the jBYTEmark kernels with the 
SGI JIT compiler, and over four iterations for IBM's JIT compiler and Sun’s HotSpot 
1 .0 server compiler. Since the SGI JIT compiler compiles every method before its first 
execution, run times with it have stabilized by the second iteration, with all time being 
spent in JIT-compiled code execution. The extra time in the first iteration is spent in JIT 
compilation, ranging up to 12% of the final JIT-compiled code execution time for these 
small kernels. 

The IBM and HotSpot compilers are more sophisticated than the SGI JIT compiler, 
with more of the conventional compiler analyses and optimizations. However, the IBM 
JIT compiler does modify some of these traditional algorithms for shorter compilation 
time. Both systems delay compiling methods until their execution time or frequency 
have reached certain thresholds. In the results in figure [Qfor both these compilers, the 
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Fig. 1. Relative run times for multiple iterations of jBYTEmark kernels. Each line represents 
timings for an individual kernel, with successive iterations along the horizontal axis. All times are 
scaled such that the time of the final iteration for each kernel is 1.0. The graphs’ vertical axes use 
three different scales, the last two being logarithmic 



effects of longer compile time, higher-quality generated code, and delayed compilation 
can be seen in the higher first-iteration times and in the greater variance among later 
iterations, as more compilations are performed. Despite the longer compilation times of 
the IBM and HotSpot compilers, the quality of the code they produce and their selective 
application make them suitable and effective for use at run time as well. 



5 Run-Time Performance for Java Compilation Methods 

5.1 Comparision of SGI JIT and AOT Run Time for SPEC JVM98 

Table Ogives run times for several SPEC JVM98 tests using the SGI JIT compiler, SGI 
AOT compiler, and TowerJ compilers. These should not be considered official SPEC 
JVM98 results since they were not obtained according to the official SPEC JVM98 run 
rules. SGI AOT results were obtained using AOT-compiled SPEC JVM98 code at the 
-03 optimization level but with JIT-compiled standard Java libraries. 

The most performance-critical methods in mpegaudio supply the AOT compiler’s 
global optimizer with large basic blocks of array computation and floating-point code to 
optimize; it does well, in particular removing many redundant memory operations. This 



Table 1. SPEC JVM98 JIT vs. AOT Run-Time Performance 





SGI JIT 


Time (s) 
SGI AOT 


TowerJ 


Speedup over SGI JIT 
SGI AOT TowerJ 


compress 


62.2 


61.1 


69.7 


1.03 


.90 


j ess 


61.1 


77.2 


63.9 


.79 


.96 


raytrace 


64.0 


98.2 


82.4 


.65 


.78 


db 


79.5 


89.3 


416.6 


.89 


.19 


mpegaudio 


50.2 


39.5 


68.7 


1.27 


.73 


mtrt 


70.0 


106.2 


84.7 


.66 


.83 


jack 


54.8 


75.4 


101.4 


.82 


.54 
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in general is the class of application for which the AOT compiler is superior. Exception- 
laden code, as in j ack, tends to be slower with the AOT compiler because of the less 
efficient DWARF runtime support for exception handling and stack unwinding. Most of 
the run time in ray trace and mtrt is spent in small accessor methods which simply 
load and return a field from an object. In terms of instruction count, AOT-compiled 
invocations of these small methods are 32% longer than JIT-compiled invocations, due to 
longer field access and method invocation code, as described earlier, and longer method 
prologue/epilogue code, since the AOT compiler’s generic back end is not optimally 
adapted for the shared JIT/AOT calling convention. 

On the whole, the AOT compiler is unable to achieve significantly higher perfor- 
mance than the JIT compiler. While it is able to use the advanced compiler back-end 
technology SGI had already developed for other languages, that technology is in fact 
not currently well-suited for optimizing Java code. It does not contain the analyses nec- 
essary for removal of null pointer or array subscript checks, and its optimizer does not 
operate well in the presence of the traps used for runtime checking and resolution of 
field and method accesses. Finally, the inability to rewrite the code for those accesses 
hurts performance of AOT-compiled code for the vast majority of applications. 

Without an understanding of the TowerJ implementation, it is difficult to explain its 
performance in detail. Profiling data indicates that at least for the worst cases, db and 
jack, its inefficiency may actually lie in the runtime support, e.g., synchronization, 
garbage collection, or threading support. The compiler itself is not the only important 
piece of the performance picture, particularly for languages like Java. 

5.2 Comparison of SGI JIT and AOT Run Time for Other Benchmarks 

Embedded CaffeineMark 3.0 0 is a recent version of one of the earliest Java bench- 
marks. It is actually a suite of micro-benchmarks, very small pieces of code intended to 
measure specific aspects of a system’s performance rather than to represent real appli- 
cations. 

Table Q presents CaffeineMark results for the SGI JIT and AOT compilers. The 
AOT compiler is far superior to the JIT compiler for these tests, as their artificial code 
and tight loops are handled easily by the global optimizer and scheduler. While this 
provides further evidence that the AOT compiler does certain things well, the striking 



Table 2. Embedded CaffeineMark 3.0 JIT vs. AOT Run-Time Performance 





SGI JIT 


SGI AOT 


Speedup with AOT 


Sieve 


3304 


7693 


2.328 


Loop 


8572 


25572 


2.983 


Logic 


4623 


33232 


7.188 


String 


7986 


7525 


0.942 


Float 


5230 


14835 


2.836 


Method 


3150 


2733 


0.867 


Overall score 


5081 


11219 


2.208 
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difference between these results and those for SPEC JVM98 illustrate why benchmarks 
like CaffeineMark have fallen out of favor in recent years; they do not tend to be good 
predictors of performance in real applications. 

VolanoMark 2. 1 El is a pure Java server benchmark measuring the average number 
of messages transferred per second by a server among multiple clients. Despite TowerJ’s 
well-documented VolanoMark performance on IA-32 Linux, it is able to achieve only 
592 messages per second on Irix, compared to 1701 messages per second with the 
SGI Java2 vl.2.2 and JIT compiler. This illustrates not only that an AOT system can 
be outperformed by a virtual machine with a simple JIT compiler, but also that high 
performance is not necessarily a portable phenomenon. Developers of AOT systems who 
want to claim portability across multiple platforms must pay attention to performance 
tuning across those platforms. 

6 Related Work 

The research literature appears to contain no work like that presented in this paper, 
comparing the JIT and AOT approaches to Java compilation and presenting experimental 
results, particularly within the context of a single Java execution environment. Significant 
JIT compilers that have been well-documented in the research literature are those from 
IBM 0, Intel [0, and the CACAO | ITT| and LaTTe projects (1 71 ]. Those that have not 
include Sun’s HotSpot compiler o and Compaq’s Fast JVM 0 . Appeal’s JRockit fTCHl 
virtual machine implements JIT compilation at multiple levels of optimization. IBM 
Research is also developing a dynamic optimizing compiler written completely in Java 
called Jalapeno 0. 

In addition to TowerJ dsi, clean-room AOT systems that are currently available 
include NaturalBridge’s BulletTrain iTEfll and IBM’s HPCJ. All of these products omit 
support for some parts of the Java platform, such as JNI or the Reflection API. In partic- 
ular, TowerJ is the only one which supports true dynamic class loading, as its runtime 
support does contain a bytecode interpreter. IBM Research 0 has adapted IBM compil- 
ers for high-performance numerical computation in Java by targeting the removal of null 
pointer and array subscript checking, and by improving array performance through new 
library classes and the introduction of semantic expansion techniques into the compiler. 
Marmot El is a static, highly optimizing AOT compiler, written almost entirely in Java, 
developed at Microsoft Research. 

7 Conclusions 

The Java programming language and the underlying virtual machine model have intro- 
duced new complexities for compilation that are currently being addressed in a variety 
of ways. This paper has discussed the JIT and AOT approaches to Java compilation, pre- 
sented specific implementations and performance results within the context of a single 
Java execution environment, and discussed the trade-offs of compile time vs. run time 
in JIT compilers. 

While AOT compilation currently does achieve higher performance in certain do- 
mains and has the potential for even further improvement, it has yet to prove itself over 
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JIT compilation technology in general. At present, there are still natural places in the 
Java implementation landscape both for AOT compilers and for JIT compilers ranging 
from the simple to the sophisticated, and the best implementations may be those that 
combine multiple approaches intelligently. It is clear that the task of compiling Java for 
high performance poses a new set of challenges which deserve continuing attention. 
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Abstract. In this paper, we present an analytical performance model of 
the parallel left-right looking out-of-core LU factorization algorithm. We 
show the accuracy of the performance prediction for a prototype imple- 
mentation in the ScaLAPACK library. We will show that with a correct 
distribution of the matrix and with an overlap of IO by computation, 
we obtain performances similar to those of the in-core algorithm. To get 
such performances, the size of the physical main memory only need to 
be proportional to the product of the matrix order (not the matrix size) 
by the ratio of the IO bandwidth and the computation rate: There is no 
need of large main memory for the factorization of huge matrix! 



1 Introduction 

Many of important computational applications involve solving problems with 
very large data sets | 7 ). For example astronomical simulation, crash test sim- 
ulation, global climate modelling, and many other scientific and engineering 
problems can involve data sets that are too large to fit in main memory. Using 
parallelism can reduce the computation time and increase the available memory 
size, but for challenging applications the memory is ever insufficient in size: for 
instance in a mesh decomposition of a mechanical problem, a scientist would 
like to increase accuracy by an increase of the mesh size. Those applications are 
referred as “parallel out-of-core" applications. 

To increase the available memory size, a trivial solution is to use the virtual 
memory mechanism presents in modern operating system. Unfortunatly, in [2| 
we shown this solution is inefficient if standard paging policies are employed. To 
get the best performances, the algorithm must be generally restructured with 
explicit I/O calls. In this paper, we present a study of such a restructuration for 
the matrix LU factorization problem. 

The LU factorization is the kernel of many applications. Thus, the importance 
of optimizing this routine has not to be proved because of the increasing demand 

* This work is supported by a grant of the “Pole de Modelisation de la Region Pi- 
cardie” . 
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of applications dealing with large matrices. In this paper we present an analytical 
performance model of the parallel left-right looking out-of-core algorithm which 
is used in ScaLAPACK |IJ . The aim of this performance prediction model is to 
derive optimization for the algorithm. 

In Section 0 and El we describe the LU factorization and the ScaLAPACK 
parallel version. In Section EJwe present the out-of-core LU factorization and the 
analytical performance model in Section 0 In Section 0 we analyze the overhead 
of the algorithm and show how to avoid it. 

2 LU Factorization 

The LU factorization of a matrix A = <,;.j</v is the decomposition of A 

as a product of two matrices L = (lij)i<ij<N and U = (uij)i<i,j<N? such that 
A = LU where L is lower triangular (i.e. Uj = 0 for 1 < j < i < N) and U is 
upper triangular (i.e. Uij = 0 for 1 < i < j < N). 

A well know method for parallelization of the LU factorization is based on the 
blocked right-looking algorithm. This algorithm is based on a block decomposition 
of matrices A , L and U: 

f Aoo An \ _ f Lqo 0 \ f Uqo Ur\ \ 

\A 10 Ai J \Uo Ln J ^0 U u J 

This block decomposition gives the following equations: 

A 0 o = LqoUqo (1) Aio = LiqUqo (3) 

Aoi = LqoUqi (2) An = LioUoi + LiiUn (4) 

These equations lead to the following recursive algorithm: 

1. Compute the factorization Aoo = LoqUqo in equation (EG) (may be by another 
method) . 

2. Compute L 0 1 (resp. U w ) from equation (EJ (resp. J3)). This computation 
can be done by triangular solve (Loo and Uoo are triangular). 

3. Compute Ln and Un from equation (0): 

(a) Compute the new matrix A' = An — LioUqi- 

(b) Recursively factorize A' = LnUn. 

This algorithm is called right-looking because once new matrix A' is computed, 
the left part (Loo and Loi) of the matrix is not used in the recursive computation. 
It is also true for the upper part (Uqo and Lio). Moreover, it is easy to show 
that this computation can be done data in place: only one array is necessary to 
hold initial matrix A and resulting matrices L and U . 

For numerical stability, partial pivoting (generally row pivoting) is introduced 
in the computation. Then, the result of the factorization is matrices L and U 
plus the permutation matrix P such that PA = LU. 

In right looking algorithm with partial pivoting, the factorization of Aoo & n d 
the computation of Loi are merged in the first step. For the sake of presenta- 
tion, we present an algorithm with partial pivoting (data in place) where row 
interchanges are applied in two stages. 
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la. 



lb. 

2 . 

3a. 

3b. 

4. 



Compute factorization P 





Uqq where P is permutation ma- 



trix which represents partial pivoting: the left part of matrix A (i.e. 



is factorized. 

Apply pivot P to the right part of matrix A (i.e. 





Compute Uqi from equation (El- 
Compute new matrix A! = An — LiqUq\. 

Compute Ln, Un and P' by a recursive call of factorization P' A! = LnU\\ 
(. P ' is the permutation matrix.) 

Apply pivot P' to the lower left part of matrix A (i.e. the L\q computed in 
the first step). Finally, return the composition of P and P' . 



■ 





Fig. 1 . A recursive call to the right-looking algorithm. Horizontal lines represent piv- 
oting. Dashed lines represent part of rows which are not yet pivoted. 



The FigurcD] shows the different steps for the second recursive call of the right- 
looking factorization. 



3 Parallelization 

In ScaLAPACK, the parallelization of the previous algorithm is based on a data- 
parallel approach: the matrix is distributed on processors and the computation 
is distributed according to the owner compute rule. 

The matrix is decomposed in k x k blocks. As noticed above, at each recursive 
application of the right looking algorithm, the left and upper part of the matrix 
is factorized (modulo a permutation in the lower left part of the matrix). So, for 
load balancing , a cyclic distribution of the data is used. 

The matrix is distributed block cyclic on a (virtual) grid of p rows and q 
columns of processors. The block decomposition of the algorithm (shown in Figure 
0 corresponds to the block distribution of the matrix. So step la of the algorithm 
is computed by one column of p processors; step 2 is computed by one row of 
the q processors ; step 3a is computed by the whole grid. Pivoting step lb (resp. 
4) is executed concurrently with computation step la (resp. 3b). 

Now let us describe more precisely the different steps of the algorithm. Step 
la is implemented by ScaLAPACK function pdgetf2, which factorize block of 
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columns. For each diagonal element of the upper block (i.e. Aoo) the following 
operations are applied: 

1. determine pivot by a reduce communication primitive and exchange the pivot 
row with the current row; 

2. broadcast the pivot row to columns of processors; 

3. scale, i.e. divide, the column under pivot by the pivot value and update the 
matrix elements on the right of the column. 

Step 2 of the algorithm is implemented by ScaLAPACK function pdtrsm: 
the left-upper block (i.e. A 0 o) is broadcasted to the processors row followed by a 
(BLAS) triangular solve. 

Step 3a is implemented by ScaLAPACK function pdgemm: the blocks corre- 
sponding to Uqi are broadcasted on columns (of processors); the blocks corre- 
sponding to L w are broadcasted on rows (of processors); then the blocks are 
multiplied to update A! . 

The performance of the parallel algorithm depends on the size of the blocks 
and the grid topology. The size of the block determines the degree and granu- 
larity of parallelism and also the performance of the BLAS-3 routines used by 
ScaLAPACK. The topology of the grid determines the cost of communications. 
In pj, it is shown that best performances are obtained with a grid with few rows: 
step la of the algorithm is fine grained and involves small communications (so 
a lot of communication latencies) for pivoting and for L\q computation. 



4 Parallel Out-of-Core LU Factorization 

Now, we consider the situation where matrix A is too large to fit in main memory. 
We present the parallel out-of-core left-right looking LU 
factorization algorithm used by the ScaLAPACK rou- 
tine pfdgetrf for parallel out-of-core LU factorization 
0. Similar algorithms are also described in jEpj. In the 
algorithm the matrix is divided in blocks of columns 
called superblocks. The width of the superblock is de- 
termined by the amount of physical available memory. 

Like the previous parallel algorithm, the matrix is 
logically block cyclic distributed on the p x q grid of 
processors. But only blocks of the current superblock 
are in main memory, the other are on disks. 

The parallel out-of-core algorithm is an extension of 
the parallel in-core algorithm. It factorizes the matrix 
from left to right, superblock per superblock. Each time a new superblock of 
the matrix is fetched in memory (called the active super block), all previous piv- 
oting and update of a history of the right-looking algorithm are applied to the 
active superblock. To do this update, superblocks lying on the left of the active 
superblock are read again. Once the update is finished, the right-looking algo- 
rithm resumes on the updated superblock, and the factorized active superblock 




Fig. 2. Superblocks. 
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is written on disk. Once the last superblock is factorized, the matrix is read 
again to apply the remaining row pivoting of the recursive phases (step 4). 

The update of each active superblock is summarized in Figure El When a left 
superblock is considered (called the current superblock), the update consists in 
applying row pivoting to the active superblock and: 

1’. read the under-diagonal part of current superblock; 

2’. compute the Z7oi part of the active superblock by a triangular solve (function 
pdtrsm); 

3’. update An, i.e. sub the product of Uoi part of the active superblock by Lio 
of the current superblock (function pdgemm). 



5 Performances Prediction 

In this section we present an architectural model and a execution time prediction 
of the parallel out-of-core left-right looking algorithm. 



pdgetf 2: 

“=EL?(ELr 1 t 2i - 2 >+EL'V 1 w - 1) ) (5: 

+¥ HjZi ( fexj ' +j el? 

_£s ( Nk 2 , N 2 k Nk n\ 

~ p \ 6 2 2 6 J 

(7: 

pdtrsm: 

jE-Ii 1 a t k 3 xi = |±X(JV 2 fc-JVfe 2 ) (9) 



(S(B-l)+J]®7 i 1 (iXS))x(/?|+* 2 T*) 
S(S-l)(/3’+fe 2 r’)+ (10) 

mpii^+k 2 ^) (ii) 



pdgemm: 



j- y i ~ L ~ 1 i 2 x2 ag k 3 

pq y 

= ^-( N 3 + Nk 2 _ N 2 k \ 
pq \ 3 J 



(12) 



S((B-l)ffg+ B1B 2 ~ 11 fc 2 rP) 

E*:f aB-i)^+s^Ak 2 ^ 

+ (i-l)B(S-l)fe 2 r«) 



(13) 



EL? lix W 



^zii k 2 T ^-i)B 2 k 2 y 



JV(JV-if)(6/3’ + (Jffc+4JVfc-3fe 2 )Tj) 






(14) 



(15) 



TUT 



2 SNKt 



(16) 



ELf 1 ix( - B(B 2 1> fc 2 r*°+(i~l)B 2 fe 2 T*°) 

JV(JV-Jf)((iffc + 4JVfc-3fc 2 )r*“) 

12 it fc V 1 ' / 



Fig. 3. Costs of the different steps of the left-right looking algorithm for out-of-core 
LU factorization. 
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5.1 Architectural Model 

The architectural model is a distributed memory machine with an interconnec- 
tion network and one disk on each node, like a cluster. Each node stores its 
blocks on its own disk. Let characterize this kind of architecture by some con- 
stants representing the computation time, the communication time and the 10 
time. 

Computation time. It is usually based on the time required for the compu- 
tation of one floating point operation on one processor and is represented 
by a constant a. In fact, this time is not constant and depends on processor 
memory hierarchy and on the kind of computation. For instance a matrices 
multiply algorithm exhibits good cache reuse whereas product of a vector by 
a scale has poor temporal locality. So we distinguish three times for floating 
point operations which appear in the algorithm: a g for matrix multiply , at 
for triangular solve, and a s for scaling of vectors. 

Communication time. As usual, the communication time is represented by 
the (3 + Vt model, where (3 is the startup time and r is the time to transmit 
one unit of datum and V is the volume of data to be communicated. We 
consider only broadcast communication in our model. The constants (3 and 
r are dependent on the topology of the virtual grid: (3% is the startup time 
for a column of p processors to broadcast dataQ on their rows, and 1/r® 
represents the throughput. Similarly [3? and t? denote time for one row of 
q processors to broadcast data on their columns. These functions depend on 
communication network. For instance, for a cluster of workstations with a 
switch, the broadcast can be implemented by a tree diffusion. Then = 
log 2 q x (3 and r® = log 2 9 x ~ where f3 is the startup communication time for 
one node and 1/r the throughput of the medium. With a hub (i.e. a bus), 
the model is: /3| = p(q — 1) x (3 and r® = r if q > 1, r® = 0 if q = 1. 

IO time. The 10 time is based on the throughput of a disk. Let t 1 ° be the time 
to read or write one word for one disk, then r®° = — is the time to read or 
write p words in parallel for p independent disks. 



5.2 Modelling 

To model the algorithm, we estimate the time used by each function. For each 
function, we distinguish computation time and communication time, and we 
distinguish the intrinsic cost time of the parallel right-looking algorithm and 
cost time introduced by the out-of-core extension. 

Let N be the matrix order, K be the column width of superblock, the block 
size is k x k . The grid of processors is composed of p rows of q columns. We have 
the following constraints for the different constants: N is multiple of K , and K 
is multiple of k and q. Let L = ^ be the block width of the matrix, S = ^ the 
number of superblocks, and B = ^ be the block width of a superblock. 

Data are equi-distributed on processors. 



i 
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The Figure 0 collects costs of the different steps of the algorithm. For the 
sake of simplicity, we don’t consider the pivoting cost in our analysis. This cost 
is mainly the cost of reduce operations for each element of the diagonal and the 
cost of row interchanges, plus the cost of re-read /write of the matrix. This time 
can be easily integrated in the analysis if necessary. 



pdgetf2 cost. Step 1 of the algorithm (ScaLAPACK function pdgetf2) is ap- 
plied on block columns (of width k) under the diagonal. There are L such blocks. 
This computation is independent of the superblock size. For the computation 
cost, we distinguish the computation of blocks on the diagonal @ and the com- 
putation on the blocks under the diagonal ®. The total computation time for 
pdgetf2 function for the whole N x TV matrix is ©■ 

For communications in pdgetf2, for each block on the diagonal and for each 
element on the diagonal, the right part is broadcasted to the processor column 

(EJ. 



pdtrsm cost. Step 2 of the algorithm (computation of Uoi) is applied on each 
block of row lying on right of the diagonal: there is a triangular solve for each of 
them. The computation cost of a triangular solve between two blocks of size k x k 
is a '. t k 3 . The total computation cost for every pdtrsm applied by the algorithm 

is (EJ. 

The communication cost for pdtrsm is the broadcast of diagonal blocks to 
the processors row. One broadcast is done during the factorization of the active 
superblock (EH, and another one is needed during the future updates (EJ. 



pdgemm cost. Step 3 of the algorithm updates the trailing sub-matrix A! . The 
computation is mainly matrix multiply plus broadcast. For a trailing sub-matrix 
of order H, there is ("f-) 2 block multiplications of size k x k. The cost of such a 
multiplication is 2 a g fc . The total computation cost is (EJ. 

For the communication cost, we distinguish cost of factorization of the active 
superblock and cost of the update of the active superblock. For the factorization 
of the superblock, the cost is the broadcast of one row of blocks and the broadcast 
of one columns of blocks (EJ. The Figure 0 illustrates the successive updates 
for an active superblock. All blocks of column under diagonal in left superblock 
read are broadcasted (EJ. In the same time symmetric rows of blocks of the 
current superblock are broadcasted m 



IO cost. The 10 cost corresponds to the read/write of the active superblock 
(EJ and the read of left superblocks (EJ. 

5.3 Experimental Validation of the Analytical Model 

To validate our prediction model, we ran the ScaLAPACK out-of-core factoriza- 
tion program on a cluster of 8 PC-Celeron running Linux and interconnected by 
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a Fast-Ethernet switch. Each node has 96 Mb of physical memory. The model 
described in previous sections is instanced with the following constants (exper- 
imental measurements) : l/a s = 237 Mflops, l/a t = 123 Mflops, 1 /a s = 16 
Mflops, (3 = 1.7 ms, 1/r = 11 MB/s, l/r io = 1.8 MB/s. 

The Table □ shows the comparison between the running time and the pre- 
dicted time (in italic ) of the program. We measured time for Input/Output, for 
Computation and we distinguished the Communication time during the factor- 
ization of active superblocks and the Communication time during the update 
of active superblocks. For computation and communication, running time was 
close to the predicted time. There was some differences for 10 times. It is mainly 
due to our rough model of 10: 10 performances are more difficult to model 
because access file performances depend on the layout of the file on the disk 
(fragmentation) . 

6 Out-of-Core Overhead Analysis 

In comparison with the standard in-core algorithm, the overhead of the out- 
of-core algorithm is the extra 10 cost and broadcast (of columns) cost for the 
update of the active superblock: for each active superblock, left superblocks must 
be read and broadcasted once again! 

This overhead cost is represented by equations EH> and El for communi- 
cations and for 10. It is easy to show that ii K = N (i.e. S = 1) then this 
cost is equal to zero: it is the in-core algorithm execution time. 

The overhead cost is 0(N 3 ), and is non negligible. In the following, we will 
show how to reduce this overhead cost. Let Oc = (lO) + El be the overhead 
communication cost and Oio = dm be the overhead 10 cost. 

6.1 Reducing Overhead Communication Cost 

As shown by the model and experimental results, the topology of the grid of 
processors has a great influence on the overhead communication cost: 

Fact 1 If the number of columns q is equal to 1, then Oc = 0/ 

If there is only one column of processors, there is no broadcast of column dur- 
ing the update. If we consider a communication model where broadcast cost is 
increasing with the number of processors, then greater the number of columns 
is, greater is Oc- Figure 0 shows the influence of topology on the performances. 
In the same figure, there are plots for the predicted performances of the in-core 
right looking algorithm. We employed constants of our small PC-Celeron clus- 
ter. With a topology of one column of 16 processors (a ring) there is no extra 
communication cost. The difference with the in-core performances is due to the 
extra 10. 

6.2 Overlapping IO and Computations 

A trivial way to avoid the 10 overhead is to overlap this 10 by the computation. 
In the left-right looking algorithm, during updates of the active superblock, the 
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Table 1. Comparison of experimental and theoretical (in italic) running times and performances of the left-right looking out-of-core LU 
factorization. M is the matrix order, K the superblock wide, p the number of rows of the processor grid, q the number of columns, S 
is then number of superblocks. The size of the matrix in Gigabyte is given in the first column. Times are given in days (d), hours (h), 
minutes (nr) and seconds (s).The last column shows the real and predicted performances in Mflops. 
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Theoretical performances of LU factorization 




N 



Fig. 4. Theoretical performances of the LU factorization on a cluster of 16 PC- 
Celeron/Linux (64Mb) interconnected by Fast-Ethernet switch: comparison of the par- 
allel in-core (IC) right-looking algorithm and the parallel out-of-core (OoC) left-right 
looking algorithm, with 3 kinds of topology (1 x 16, 4x4, and 16 x 1). N is the matrix 
order. 



left superblocks are read from left to right. An overlapping scheme is to read 
the next left superblock during the update of active superblock with the current 
one: if the time for this update is greater than the time for reading the next 
super block, then the overhead 10 cost is avoided. 

Now, let consider the resource needed to achieve such a total overlapping. 
Let M be the amount of memory devoted to superblock in one processor. For 
a matrix order N the width of a superblock is then K — ■ Let 0° IO the 

overhead 10 cost not overlapped in this new scheme. 

Theorem 1. If the number of column of processor q is equal to 1 and if pM > 
N i<ff then 0° IO = 0. 

Proof. Let’s go back over the update part of the algorithm (Figure 0). For the 
sake of simplicity, we underestimate the update computation time, and we con- 
sider only the main cost of this update: the pdgemm part. For each current su- 
perblock left the active superblock the update time is equal to the communication 
of B blocks of rows in the active superblock (q = 1 there is no communication of 
blocks of columns for the current superblock) plus the computation time of the 
update of the active superblock. The first part, i.e. the communication part, is 
equal to: 



B 2 k\P + BpP ( 18 ) 

Let H be the height of the current left superblock in the update of the active 
superblock (in number of k x k blocks). The computation time for the update 
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of the active superblock with the current one is: 

((H - B) x B + B ^ B ~ ^ ) x (2 C ^-k 3 B) (19) 

2 pq 

The reading time of the next left superblock is : 

{{H - 2B) x B + B{B 2 l) ) x (k 2 Tp°)) (20) 

Now consider the situation where the 10 is overlapped by computation (i.e. 
0°q = 0), that is > 1. Note that if > 1 then > 1. So we 

restrict the problem to the following: determinate for which superblock width B 

((H - B)B + gjjl) x (2 %&B) > i 
((H -2B)B+ B{B ~ 1) ) X 



The first part of this expression is always greater than 1. We determinate for 
which superblock width the second part of the expression is greater than 1. By 
definition K = k * B ( K is the width of superblock in number of columns), and 



°q = We have 



(2 ^k 3 B) 

l/lrT-iO 

ft T pq 



> 1^2 



^ K > 1 ^ K > t ^- 



Since K = pqM/N and q 



1, if pM > then 

ZOig 




> 1, i.e. Oj 0 = 0. 



Note that in the algorithm, the width of the active and current superblock are 
equal. An idea to reduce the need for physical memory is to specify different 
width for the active and the current superblock during the update: increase the 
width of the active superblock (i.e. computation time) and reduce the width of 
current superblock (i.e. read time). 



7 Conclusion 

In this paper we presented a performance prediction model of the parallel out-of- 
core left-right looking LU factorization algorithm which can be found in ScaLA- 
PACK. This algorithm is mainly an extension of the parallel right-looking LU 
factorization algorithm. Thanks to these modelling, we isolated the overhead 
introduced by the out-of-core version. We observed that the best virtual topol- 
ogy to avoid the communication overhead is one column of processors. We shown 
that a straightforward scheme to overlap the 10 by the computations allows us to 
reduce the IO overhead of the algorithm. We determined the memory size which 
is necessary to avoid the IO overhead. The memory size needed is proportional 
to the square root of the matrix size. 

To see if this result is practicable, consider a small cluster of PC-Celeron 
with 16 nodes and with a Fast-Ethernet switch. To factorize a 80 Gigabytes 
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matrix (a 100000 matrix order) we need 26 Megabytes (MB) of memory per 
superblock (active, current and prefetched) per node, i.e. 78 MB per node! The 
predicted execution time to factorize the matrix is 4.5 days without overlapping, 
and 2.5 days otherwise. If we substitute the Intel Celeron processors of 237 Mflops 
by Digital Alpha AXP processors of 757 Mflops, then the needed memory size 
per processor is 252 MB! The predicted computation time is about 36 hours 
without overlapping and about 21 hours otherwise (1.7 faster). This last time 
is the estimated time for the in-core algorithm with the same topology (i.e. one 
column of 16 processors). With a better topology for the in-core algorithm (4 
columns of 4 processors), the in-core algorithm takes 18 hours to factorize the 
matrix, but the memory needed by node is 5 Gigabytes: 20 times greater than 
the memory necessary for the out-of-core version! 

We plan to integrate this overlapping scheme in the ScaLAPACK parallel out- 
of-core left-right looking function, and experimentally validate the theoretical 
improvement. We plan also to study a general overlapping scheme where both 
IO and communication are overlapped by computation based on a extension of 
a previous work for the in-core case J2\ ■ 
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Abstract. This paper shows a new way of integrating task and data 
parallelism by means of a coordination language. Coordination and com- 
putational aspects are clearly separated. The former are established using 
the coordination language and the latter are coded using HPF (together 
with only a few extensions related to coordination). This way, we have a 
coordinator process that is in charge of both creating the different HPF 
tasks and establishing the communication and synchronization scheme 
among them. In the coordination part, processor and data layouts are 
also specified. The knowledge of data distribution belonging to the differ- 
ent HPF tasks at the coordination level is the key for an efficient imple- 
mentation of the communication among them. Besides that, our system 
implementation requires no change to the runtime system support of the 
HPF compiler used. We also present some experimental results that show 
the efficiency of the model. 



1 Introduction 

High Performance Fortran (HPF) [Tfl has emerged as a standard data parallel, 
high level programming language for parallel computing. However, a disadvan- 
tage of using a parallel language like HPF is that the user is constrained by 
the model of parallelism supported by the language. It is widely accepted that 
many important parallel applications cannot be efficiently implemented follow- 
ing a pure data-parallel paradigm: pipelines of data parallel tasks |2j, a common 
computation structure in image processing, signal processing or computer vision; 
multi-block codes containing irregularly structured regular meshes 0 ; multidis- 
ciplinary optimization problems like aircraft design J3J. For these applications, 
rather than having a single data-parallel program, it is more appropriate to sub- 
divide the whole computation into several data-parallel pieces, where these run 
concurrently and co-operate, thus exploiting task parallelism. 

Integration of task and data parallelism is currently an active area of research 
and several approaches have been proposed piigmi Integrating the two forms 
of parallelism cleanly and within a coherent programming model is difficult |8j . 

* This work was supported by the Spanish project CICYT TIC-99-0754-C03-03 
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In general, compiler-based approaches are limited in terms of the forms of task 
parallelism structures they can support, and runtime solutions require that the 
programmer have to manage task parallelism at a lower level than data par- 
allelism. The use of coordination models and languages to integrate task and 
data parallelism mm is proving to be a good alternative, providing a high 
level mechanism and supporting different forms of task parallelism structures in 
a clear and elegant way. Coordination languages m are a class of programming 
languages that offer a solution to the problem of managing the interaction among 
concurrent programs. The purpose of a coordination model and the associated 
language is to provide a mean of integrating a number of possibly heterogeneous 
components in such a way that the collective set forms a single application that 
can execute on and take advantage of parallel and distributed systems. 

BCL P2J P3j is a Border-based Coordination Language focused on the solu- 
tion of numerical problems, especially those with an irregular surface that can 
be decomposed into regular, block structured domains. It has been successfully 
used on the solution of domain decomposition-based problems and multi-block 
codes. Moreover, other kinds of problems with a communication pattern based 
on (sub) arrays interchange (2-D FFT, Convolution, solution of PDEs by means 
of the red-black ordering algorithm, etc.) may be defined and solved in an easy 
and clear way. 

In this paper we describe the way BCL can be used to integrate task and 
data parallelism in a clear, elegant and efficient way. Computational tasks are 
coded in HPF. The fact that the syntax of BCL has a Fortran 90 / HPF style 
makes that both the coordination and the computational parts can be written 
using the same language, i.e. , the application programmer does not need to learn 
different languages to describe different parts of the problem, in contrast with 
other approaches fZ) . The coordinator process, besides of being in charge of creat- 
ing the different tasks and establishing their coordination protocol, also specifies 
processor and data layouts. The knowledge of data distribution belonging to the 
different HPF tasks at the coordination level is the key for an efficient implemen- 
tation of the communication and synchronization among them. In BCL, unlike 
in other proposals HP. the inter-task communication schedule is established 
at compilation time. Moreover, our approach requires no change to the runtime 
support of the HPF compiler used. The evaluation of an initial prototype has 
shown the efficiency of the model. We also present some experimental results. 

The rest of the paper is structured as follows. In Sect. l2l by means of some 
examples, the use of BCL to integrate task and data parallelism is shown. In 
Sect. El some preliminary results are mentioned. Finally, in Sect. 01 some con- 
clusions are sketched. 

2 Integrating Task and Data Parallelism Using BCL 

Using BCL, the computational and coordination aspects are clearly separated, 
as the coordination paradigm proclaims. In our approach, an application consists 
of a coordinator process and several worker processes. The following code shows 
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the scheme of both a coordinator process (at the left hand side) and a worker 
process (at the right hand side). 



program program_name 
DOMAIN declarations 
CONVERGENCE declarations 
PROCESSORS declarations 

DISTRIBUTION information 
DOMAINS definitions 
BORDERS definitions 

Processes CREATION 
end 



Subroutine subrout ine_name (. . .) 
DOMAIN declarations ! dummy 
CONVERGENCE declarations ! dummy 
GRID declarations 
GRID distribution 
GRID initialization 
do while .not. converge 

PUT_BORDERS 

GET_BORDERS 
Local computation 
CONVERGENCE test 
enddo 



end subroutine subroutine_name 

The coordinator process is coded using BCL and is in charge of: 

— Defining the different blocks or domains that form the problem. Each one 
will be solved by a worker process, i.e., by an HPF task. 

— Specifying processor and data layouts. 

— Establishing the coordination scheme among worker processes: 

• Defining the borders among domains. 

• Establishing the way these borders will be updated. 

• Specifying the possible convergence criteria. 

— Creating the different worker processes. 

On the other hand, worker processes constitute the different HPF tasks that 
will solve the problem. Local computations are achieved by means of HPF sen- 
tences while the communication and synchronization among worker processes 
are carried out through some incorporated BCL primitives. 

The different primitives and the way BCL is used are shown in the next 
sections by means of two examples. The explanation is self contained, i.e., no 
previous knowledge of BCL is required. 



2.1 Example 1. Laplace’s Equation 

The following program shows the coordinator process for an irregular problem 
that solves Laplace’s equation in two dimensions using Jacobi’s finite differences 
method with 5 points. 

Au = 0 in fl (1) 

where u is a real function, 17 is the domain, a subset of i? 2 , and Dirichlet bound- 
ary conditions have been specified on <9f7, the boundary of 17: 

u = g in dfl 



(2) 
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1) program examplel 

2) D0MAIN2D u, v 

3) CONVERGENCE c OF 2 

4) PROCESSORS pi (4,4), p2(2,2) 

5) DISTRIBUTE u (BLOCK , BLOCK) ONTO pi 

6) DISTRIBUTE v (BLOCK , BLOCK) ONTO p2 

7) u = (/l,l,Nxu,Nyu/) 

8) v = (/I , 1 ,Nxv,Nyv/) 

9) u (Nxu.Nyl ,Nxu,Ny2) <- v (2,l,2,Nyv) 

10) v (1,1,1 ,Nyv) <- u (Nxu-1 ,Nyl , Nxu-1 ,Ny2) 

11) CREATE solve (u,c) ON pi 

12) CREATE solve (v,c) ON p2 

13) end 

The domains in which the problem is divided are shown in Fig. Q together with a 
possible data distribution and the border between domains. Dot lines represent 
the distribution into each HPF task. Line 2 in the coordinator process is used 
to declare two variables of type D0MAIN2D, which represent the two-dimensional 
domains. In general, the dimension ranges from 1 to 4. These variables take their 
values in lines 7 and 8. These values represent Cartesian coordinates, i.e. the 
domain assigned in line 7 is a rectangle that cover the region from point (1,1) 
to (Nxu, Nyu) . From the implementation point of view, a domain variable also 
stores the information related to its borders and the information needed from 
other(s) domain(s) (e.g. data distribution). 

The border is defined by means of the operator <-. As it can be observed 
in the program, the border definition in line 9 causes that data from column 
2 of domain v refresh part of the column Nxu of domain u. Symmetrically, the 
border definition in line 10, produces that data from column 1 of domain v are 
refreshed by part of the column Nxu-1 of domain u. 

A border definition can be optionally labeled with a number that indicates 
the connection type in order to distinguish kinds of borders (or to group them 
using the same number) . The language provides useful primitives in order to ease 
(or even automatically establish) the definition of domains and their (possibly 




(Nxv, Nyv) 



Fig. 1 . Communication between two HPF tasks 
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overlapping) borders (e.g. intersection, shift, decompose, grow). The region 
sizes at both sides of the operator <- must be equal (although not their shapes). 
Optionally, a function can be used at the right hand side of the operator that 
can take as arguments different domains El- 

Line 4 declares subsets of HPF processors where the worker processes are 
executed. The data distribution into HPF processors is declared by means of 
instructions 5 and 6. The actual data distribution is done inside the different 
HPF tasks. The knowledge of the future data distribution at the coordination 
level allows a direct communication schedule, i.e., each HPF processor knows 
which part of its domain has to be sent to each processor of other tasks. 

A CONVERGENCE type variable is declared in line 3, which is passed as an 
argument to the worker processes spawned by the coordinator. The clause OF 2 
indicates the number of HPF tasks that will take part in the convergence criteria. 
The worker processes receive this variable as a dummy argument. However, when 
the type of the dummy argument is declared, the clause OF is not specified, 
as the worker processes do not need to know how many processes are solving 
the problem. This way, the reusability of the workers is improved (coordination 
aspects are specified in the coordinator process). 

Lines 11 and 12 spawn the worker processes in an asynchronous way so that 
both HPF tasks are executed in parallel. The code for worker processes is shown 
in the following program: 

1) subroutine solve (u,c) 

2) D0MAIN2D u 

3) CONVERGENCE c 

4) double precision, GRID2D : : g, g_old 

5) !hpf$ distribute (BLOCK, BLOCK) : : g, g_old 

6) gZDOMAIN = u 

7) g_old“/„DOMAIN = u 

8) call initGrid (g) 

9) do i=l, niters 

10) g_old = g 

11) PUT_B0RDERS (g) 

12) GET_B0RDERS (g) 

13) call computeLocal (g,g_old) 

14) error = computeNorm (g,g_old) 

15) CONVERGE (c , error , maxim) 

16) Print *, "Max norm: ", error 

17) enddo 

18) end subroutine solve 

Lines 2 and 3 declare dummy arguments u and c, which are passed from 
the coordinator. The GRID attribute appears in line 4. This attribute is used 
to declare a record with two fields, the data array and an associated domain. 
Therefore, the variable g contains a domain, g“/ 0 DOMAIN, and an array of double 
precision numbers, g7.DAT A, which will be dynamically created when a value is 
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assigned to the domain field in line 6. This is an extension of our language since 
a dynamic array can not be a field of a standard Fortran 90 record. 

Note that line 5 is a special kind of distribution since it produces the distri- 
bution of the field DATA and the replication of the field DOMAIN. 

Statement 10 produces the assignment of two variables with GRID attribute. 
Since g_old has its domain already defined, this instruction will just produce 
a copy of the values of field g"/ 0 DATA to g_old%DATA. In general, a variable with 
GRID attribute can be assigned to another variable of the same type if they have 
the same domain size or if the assigned variable has no DOMAIN defined yet. In 
this case, before copying the data stored in the DATA field, a dynamic allocation 
of the field DATA of the receiving variable is carried out. 

Lines 11 and 12 are the first where communication is achieved. The instruc- 
tion PUT .BORDERS (g) in line 11 causes that the data from g"/ 0 DATA needed by the 
other task (see instructions 9 and 10 in the coordinator process) are sent. This 
is an asynchronous operation. In order to receive the data needed to update the 
border associated to the domain belonging to g, the instruction GET .BORDERS (g) 
is used in line 12. The worker process will suspend its execution until the data 
needed to update its border are received. 

In this example, there is only one border for each domain. In general, if several 
borders are defined for a domain, PUT.B0RDERS and GET.B0RDERS will affect all 
of them. However, both instructions may optionally have a second argument, an 
integer number that represents the kind of border that is desired to be “sent” 
or “received”. 

Local computation is accomplished by the subroutines called in lines 13 and 
14 while the convergence method is tested in line 15. The instruction CONVERGE 
causes a communication between the two tasks that share the variable c. In 
general, this instruction is used when an application needs a reduction of a 
scalar value. 

In order to stress the way our approach achieves the code reusability, Fig. 0 
shows another irregular problem that is solved by the following program: 



Nru 




Fig. 2. Another irregular problem 
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1) program examplel_bis 

2) D0MAIN2D u, v, w 

3) CONVERGENCE c OF 3 

4) PROCESSORS pi (4,4), p2(2,2), p3(2,2) 

5) DISTRIBUTE u (BLOCK , BLOCK) ONTO pi 

6) DISTRIBUTE v (BLOCK , BLOCK) ONTO p2 

7) DISTRIBUTE w (BLOCK , BLOCK) ONTO p3 

8) u = (/I , 1 ,Ncu,Nru/) 

9) v = (/I , 1 ,Ncv,Nrv/) 

10) w = (/I , 1 ,Ncw,Nrw/) 

11) u (Ncu, 1 ,Ncu,Nrw) <- w (2,l,2,Nrw) 

12) u (Ncu,Ndv,Ncu,Nru) <- v (2,l,2,Nrv) 

13) v (l,l,l,Nrv) <- u (Ncu-1 ,Ndv,Ncu-l ,Nru) 

14) w (l,l,l,Nrw) <- u (Ncu-1 , 1 , Ncu-1 ,Nrw) 

15) CREATE solve (u,c) ON pi 

16) CREATE solve (v,c) ON p2 

17) CREATE solve (w,c) ON p3 

18) end 

The most relevant aspect of this example is that subroutine solve does not need 
to be modified, it is the same one than in the example before. This is due to the 
separation that has been done between the definition of the domains (and their 
relations) and the computational part. Lines 15, 16 and 17 are instantiations of 
the same process for different domains. 



2.2 Example 2. 2-D Fast Fourier Transform 

2-D FFT transform is probably the application most widely used to demonstrate 
the usefulness of exploiting a mixture of both task and data parallelism BPS- 
Given an NxN array of complex values, a 2-D FFT entails performing N inde- 
pendent 1-D FFTs on the columns of the input array, followed by N independent 
1-D FFTs on its rows. 

1) program example2 

2) D0MAIN2D a, b 

3) PROCESSORS pi (Np) , p2(Np) 

4) DISTRIBUTE a (* .BLOCK) ONTO pi 

5) DISTRIBUTE b (BLOCK,*) ONTO p2 

6) a = (/I , 1 , N ,N/) 

7) b = (/I , 1 , N ,N/) 

8) a <- b 

9) CREATE stage 1 (a) ON pi 

10) CREATE stage2 (b) ON p2 

11) end 

In order to increase the solution performance and scalability, a pipeline solution 
scheme is preferred as proved in pj and m- This mixed task and data parallelism 
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scheme can be easily codified using BCL. The code above shows the coordinator 
process, which simply declares the domain sizes and distributions, defines the 
border (in this case, the whole array) and creates both tasks. For this kind of 
problems there is no convergence criteria. 

The worker processes are coded as follows. The stage 1 reads an input ele- 
ment, performs the 1-D transformations and calls PUT .BORDERS (a) . The stage 
2 calls GET_B0RDERS(b) to receive the array, performs the 1-D transformations 
and writes the result. The communication schedule is known by both tasks, so 
that a point to point communication between the different HPF processors can 
be carried out. 

subroutine stagel (d) 

D0MAIN2D d 
complex, GRID2D : : a 
!hpf$ distribute a(*, block) 
a°/„D0MAIN = d 
do i= 1, n_images 
! a new input stream element 
call read_stream (a7 0 DATA) 

!hpf$ independent 
do icol = 1, N 

call fftSlice(a7 0 DATA( : ,icol)) 
enddo 

PUT_B0RDERS (a) 
enddo 
end 



3 Preliminary Results 

In order to evaluate the performance of BCL, a prototype has been developed. 
Several examples have been used to test it and the obtained preliminary results 
have successfully proved the efficiency of the model m- Here, we show the 
results for the two problems explained above. 

A cluster of 4 nodes DEC AlphaServer 4100 interconnected by means of 
Memory Channel has been used. Each node has 4 processors Alpha 22164 (300 
MHz) sharing a 256 MB RAM memory. The operating system is Digital Unix 
V4.0D (Rev. 878). The implementation is based on source-to-source transfor- 
mations together with the necessary libraries and it has been realized on top of 
the MPI communication layer and the public domain HPF compilation system 
ADAPTOR jHj No change to the HPF compiler has been needed. 

Table □ compares the results obtained for Jacobi’s method in HPF and in 
BCL considering 2, 4 and 8 domains with a 128 x 128 grid each one. The program 
has been executed for 20000 iterations. BCL offers a better performance than 
HPF due to the advantage of integrating task and data parallelism. When the 
number of processors is equal to the number of domains (only task parallelism 



subroutine stage2 (d) 

D0MAIN2D d 
complex, GRID2D : : b 
!hpf$ distribute b (block,*) 
bZDOMAIN = d 
do i= 1, n_images 
GET_BORDERS (b) 

!hpf$ independent 
do irow = 1 , N 

call f ftSlice (b°/,DATA(irow, : ) ) 
enddo 

! a new output stream element 
call write_stream (b°/ 0 DATA) 
enddo 
end 
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Table 1. Computational time (in seconds) and HPF/BCL ratio for Jacobi’s method 



Domains 


Sequential 


HPF vs. BCL 
(ratio) 








4 Processors 


8 Processors 


16 Processors 


2 


97.05 


42.40/41.27 

(1.03) 


35.05/27.66 

(1.27) 


33.73/22.67 

(1.49) 


4 


188.88 


93.90/90.06 

(1.04) 


70.75/45.06 

(1.57) 


69.61/29.28 

(2.38) 


8 


412.48 


185.62/199.66 

(0.93) 


150.54/95.85 

(1.57) 


163.67/56.43 

(2.90) 



is achieved) BCL has also shown better results. Only when there are more do- 
mains than available processors, BCL has shown less performance because of the 
context change overhead among weight processes. 

Table El shows the execution time per input array for HPF and BCL imple- 
mentations of the 2-D FFT application. Results are given for different problem 
sizes. Again, the performance of BCL is generally better. However, HPF per- 
formance is near BCL as the problem size becomes larger and the number of 
processors decreases, as it also happens in other approaches (Sj. In this situation 
HPF performance is quite good and so, the integration of task parallelism does 
not contribute so much. 



4 Conclusions 

BCL, a Border-based Coordination Language, has been used for the integration 
of task and data parallelism. By means of some examples, we have shown the 
suitability and expressiveness of the language. The clear separation of compu- 
tational and coordination aspects increases the code reusability. This way, the 
coordinator code can be re-used to solve other problems with the same geom- 



Table 2. Computational time (in milliseconds) and HPF/BCL ratio for the 2-D FFT 
problem 



Array Size 


Sequential 




HPF vs. BCL 
(ratio) 






4 Processors 


8 Processors 


16 Processors 


32 x 32 


1.507 


0.947/0.595 

(1.59) 


0.987/0.475 

(2.08) 


1.601/1.092 

(1.47) 


64 x 64 


5.165 


2.189/1.995 

(1.09) 


1.778/1.238 

(1.44) 


2.003/1.095 

(1.83) 


128 x 128 


20.536 


7.238/7.010 

(1.03) 


5.056/4.665 

(1.08) 


4.565/3.647 

(1.25) 
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etry, independently of the physics of the problem and the numerical methods 
employed. On the other hand, the worker processes can also be re-used with 
independence of the geometry. The evaluation of an initial prototype by means 
of some examples has proved the efficiency of the model. Two of them have been 
presented in this paper. 
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Abstract. This paper describes several results of parallel and distribut- 
ed computing using a production flow solver program. A coarse grained 
parallelization based on clustering of discretization grids, combined with 
partitioning of large grids, for load balancing is presented. An assessment 
is given of its performance on tightly-coupled distributed and distributed- 
shared memory platforms using large-scale scientific problems. An exper- 
iment with this solver, adapted to a Wide Area Network environment, is 
also presented. 



1 Introduction 

Recent improvements in high-performance computing hardware have made the 
simulation of complex flow models a viable analysis tool. However, further effi- 
ciency increases are needed to integrate this tool into the design process, which 
requires large numbers of separate flow analyses. The only way in which drastic 
improvements in performance can be obtained in the short term is to utilize par- 
allel and distributed computing techniques. In recent years, significant strides 
have been made towards this goal through parallelization of flow solver programs 
of importance to NASA. The objective of this paper is to assess the performance 
of one such code, identify potential bottlenecks, and determine their impact 
in the more demanding environment of distributed computing using widely- 
separated resources. 

Two realistic cases are used to evaluate performance of the solver. They con- 
cern viscous calculations about complex configurations. The computations are 
done using the flow code OVERFLOW 0, which resolves the geometrical com- 
plexity of solution domains by letting sets of separately generated and updated 
structured discretization grids exchange information through interpolation. We 
consider both single, tightly-coupled platforms, and geographically separated 
machines connected using the Globus metacomputing toolkit fjj. 

The remainder of this paper includes a brief overview of the numerical method 
(Section 0 and parallelization strategy (Section 0) used. Section 0 describes 
the parallel implementation for a geographically distributed environment, and 
compares it with that used in our earlier work [I]. Results are presented in 
Section 0 and summary remarks and a discussion of future directions, including 
a strategy for improving the load balance, are given in Section 0 
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a) Overset grid schematic b) Grouping and communication 




Fig. 1 . Basic partitioning strategy. 



2 Numerical Method 

OVERFLOW 3 uses Chimera IB] overset grid systems to solve the thin layer 
Navier-Stokes equations, augmented with a turbulence model. It is widely used 
in the aerodynamics community and is the most popular flow solver in use at 
NASA Ames Research Center. It uses finite differences in space, and implicit 
time-stepping. For steady-state problems such as those studied in this paper, fast 
relaxation to the final solution is achieved by a combination of a sophisticated 
multigrid convergence acceleration scheme and a local time-stepping method, 
which updates the solution based on a spatially varying virtual time increment. 

OVERFLOW’S speed and accuracy depend on the special properties of struc- 
tured discretization grids. Such grids individually are not well suited for geomet- 
rically complex domains, so several are used, each of which covers only part of 
the domain. The resulting configuration is an overset grid system. The solution 
proceeds by updating, at each time step, the inter-grid boundaries with inter- 
polated data from overlapping grids (Fig. Ok). The coordinates of the Chimera 
interpolation points are fixed in time and can be determined prior to the run. 



3 Parallel Implementation Basics 

Overset grid systems feature at least one level of exploitable parallelism: the 
solution on individual grids can be carried out independently by different pro- 
cessors, as described in our previous work m- However, that version of the 
code psrrij did not allow individual grids to be distributed across several pro- 
cessors, creating a load balancing problem in case of large disparities between 
grid sizes. The new version [ tlf/ffij described here solves part of this problem 
by parallelizing the solution within each grid; thus, it provides a second level 
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of exploitable parallelism. While this allows a better load balance in principle, 
three important sources of overall load imbalance remain. 



3.1 Sources of Load Imbalance 

First, all grid points are currently given equal weight in terms of associated work. 
However, some of the near-body grids require more work per point, because they 
need to solve the turbulence model in addition to the flow equations. For better 
load balance, we need to give larger weights to points inside turbulence regions. 
Second, processors may either solve part of one large grid, or one or more whole 
grids, but not both. This means that once a processor is assigned part of a large 
grid, it cannot further reduce any load imbalance by receiving more work. Third, 
even if the computational work is divided evenly among the processors, load im- 
balances may still result from disparities in communication volumes. The reason 
for this is twofold. First, if a large grid is distributed across multiple processors, 
the implicit solution process within this grid must be parallelized, which re- 
quires a significant amount of communication (indicated by the heavy arrows in 
Fig. ID)). This is in addition to any communications required to interpolate data 
from different neighbor grids. Second, the grid grouping and splitting strategy 
currently does not explicitly take into account the magnitude of the data vol- 
ume incurred by the decomposition. In jl], we described a way of minimizing the 
maximum communication volume between processors. This strategy will have to 
be further refined to reflect the possibility of individual grids being distributed. 
We discuss a promising new load balancing strategy in Section E H Some ba- 
sic features of the parallel strategy implemented in OVERFLOW, used for the 
experiments reported in this paper, are discussed below. 



3.2 Grouping Strategy and Grid Splitting 

Load balancing the parallel algorithm involves a bin-packing strategy that forms 
a number of groups, each consisting of a grid and/or a cluster of grids. If no fur- 
ther work division takes place, the total number of grid points per group is 
limited by the memory allotted to each processor. As reported in [ij, this strat- 
egy may produce a poor load balance, depending on the total number and size 
distribution of grids, and the number of processors. To alleviate this problem, 
the current grouping strategy in OVERFLOW allows us to divide large grids 
evenly across multiple processors while maintaining the implicitness of the nu- 
merical scheme within the grid; however, because a processor receiving part of a 
grid needs to exchange information with the other parts during the implicit solu- 
tion process, it effectively needs to execute in lockstep with the other processors 
working on the grid. This has led to the requirement that a processor receiving 
a partial grid not receive any other work, to avoid starvation on the other pro- 
cessors working on the grid. As a consequence, each part of an equi-partitioned 
large grid is in a group by itself (Fig. Eb). The total number of processors equals 
the total number of groups. 
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The grouping algorithm is as follows. First, specify the maximum number 
N max of grid points that each processor may receive. This number is often re- 
lated to the amount of local memory available on each processor. Since the total 
number of points N tot for the problem is given, we can now determine the min- 
imum number of processors P m in required, i.e. P m i n = \N tot /N max \. Next, sort 
the grids by size in descending order, and break up any grids whose size N ex- 
ceeds the maximum into \N/N max \ pieces. Each such piece is assigned its own 
group, and hence its own processor, as indicated above. The remaining whole 
grids whose sizes are below the maximum are assigned as follows. The largest is 
assigned to the next available processor, until all Pmin processors own at least 
one grid. Each subsequent remaining grid is assigned to the processor that is 
responsible for the smallest total number of points thus far. When no more grids 
can be placed without exceeding the maximum number of points per processor, 
the remaining number of points is determined, and the minimum correspond- 
ing number of additional processors is computed. This process is repeated for 
assigning whole grids to processors until all can be placed. 

OVERFLOW uses explicit message passing (MPI) for data transfers, which 
is suitable for both distributed and distributed-shared memory architectures. 
With grid splitting, message passing is also required between processors working 
on the same grid (Fig. |0>). Communication between groups of a partitioned 
grid follows the master-slave paradigm. One of the processors working within 
a subpartition of a grid is selected as the master of all processors in that grid. 
Upon completion of a time-step, all inter-group Chimera exchanges between the 
partitioned grid and other groups are achieved via the masters. 

Most communication in the implicit flow solution occurs during the evalua- 
tion of the nonlinear forcing term (a data parallel stencil operation), and during 
the so-called line solves. The solves, which take place in all three coordinate 
directions, feature a data dependence in the active direction, which is resolved 
using pipelining. A detailed analysis shows that an intra-grid interface point 
involves the communication of 76 words per time step, whereas a Chimera inter- 
polation point “consumes” only 5 words (without the turbulence model). This 
disparity in communication sizes per interface point is indicated in Fig. ID) by 
the relative thickness of the arrows that symbolize inter-group communications. 

Intra-grid communications take place through updates of overlap points L t| . 
While this is a well-understood process, it adds significantly to the complexity 
of the implicit solution process, and requires many changes to the serial version 
of the code. It is also interesting to note that the parallel updates of Chimera 
boundaries is different from the serial updates, a fact that may impact solution 
stability. The difference may be thought of as block Gauss-Seidel versus block 
Jacobi iteration |3j. 

4 Parallel Distributed Computing 

The distributed computing methodology used in this work is based on NASA’s 
Information Power Grid (IPG) project jH|. It is one of several infrastructural 
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approaches to Grid computing [§.] (not to be confused with computations on 
discretization grids). IPG provides an environment for resource management 
with the ability to unify multiple physically separated computational resources 
into a single virtual machine. Large-scale problems whose memory requirements 
exceed those of individual resources can potentially be solved by employing the 
IPG; a large data set can be decomposed into several smaller, more manageable 
sets, and then be distributed across a collection of resources, resulting in a parallel 
distributed computation. 

The parallel distributed methodology is similar to that of the parallel code. It 
uses the same grouping strategy and processor assignment, but processes are now 
distributed across distinct resources. Grid splitting is only allowed within a single 
resource to avoid voluminous communications between geographically separated 
machines. Data is transferred between processors by the MPICH-G p] message 
passing library, in conjunction with the Globus metacomputing toolkit [J, . Func- 
tionally, the entire application is run as a single message passing program, and 
the application programmer need not be aware of any distinction between the 
multiple machines. 



5 Results 

Several experiments are presented to demonstrate the performance of OVER- 
FLOW. Parallel performance results on single, tightly-coupled machines: SGI 
0rigin2000 (02K) and Cray T3E, are discussed in Section F II using two large- 
scale, geometrically complex, realistic test problems, consisting of grid systems 
totaling 9 (MEDIUM) and 33 (LARGE) million grid points, respectively. Paral- 
lel distributed performance on NASA IPG testbed systems — 02Ks at Ames, 
Langley and Glenn Research Centers — using the MEDIUM grid case above, is 
discussed in Section E 21 



5.1 Parallel Performance 

The current version of OVERFLOW has been run on a 128-processor 02K 
(R12000 300 MHz), and on a 512-processor Cray T3E (300 MHz). The test 
cases are as follows: 

— MEDIUM consists of a wing-body configuration mounted on the splitter 
plate of the NASA Ames 12-foot Pressure Wind Tunnel (PWT), where in- 
ternal flow of the tunnel and about the model has been simulated. The flow 
domain is discretized with 32 overset grids. 

— LARGE consists of a complex configuration of a high- wing transport vehicle 
with nacelles and deployed flaps, discretized with 153 overset grids. No tunnel 
walls were modeled. 
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Fig ,|2|shows an isometric view of some 
overset grids for a wing-body test ob- 
ject mounted in the PWT. The simula- 
tion of the tunnel’s internal flow about 
the model takes into account the ef- 
fects of the tunnel wall and other sup- 
port equipment interferences. For all 
test cases, the following options were 
selected: one-equation Spalart-Allmaras 
turbulence model, Roe upwind scheme, 

ARC3D 3-factor diagonal scheme, and 
second and fourth order smoothing (all 
common choices for practical applica- 
tions) . 

Performance statistics for the MEDIUM test case in Table d consist of Wall- 
time (the gross execution time in sec/time step, which includes computation, 
communication, and synchronization idle times), millions of floating point oper- 
ations per second (Mflops), and the average and maximum inter- group (Chimera 
boundary) communication times. Data transfer times for the intra-grid commu- 
nications are not listed separately. The Mflops reported on the 02K and the T3E 
are calculated relative to single-CPU Cray C90 Mflops for the same problem. 

The T3E is a purely 
distributed-memory machine, 
and the size of MPI pro- 
cesses run on a node is lim- 
ited by the physical memory 
located on that node. The 
smallest number of nodes on 
which test case MEDIUM 
can be run is 88, while 
LARGE requires a minimum 
of 203 nodes. By contrast, 
the 02K allows MPI pro- 
cesses to use as much mem- 
ory as is physically available 
on the whole machine, so no 
minimum number of proces- 
sors is required to solve ei- 
ther test case. However, we 
do try to maintain a reasonable balance between number of processors requested 
and maximum amount of memory used, so that the interference with jobs run 
by other users is minimized. 

We conclude from Table [I] that the MEDIUM grid configuration allows rea- 
sonable scalability up to 124 and 271 processors on the 02K and T3E, respec- 



Table 1. Parallel performance for MEDIUM test 
case: 9 million grid points. 



Machine 


CPUs 


Walltime 

(sec/step) 


Mflops 


Comm, time 
(sec/step) 


Avg 


Max 


02K 


4 


68. 


540 


3.8 


11. 




16 


18. 


2055 


3.0 


00 

00 




63 


7.0 


5257 


1.6 


3.8 




124 


4.3 


7667 


1.1 


2.7 


T3E 


88 


17. 


2127 


3.5 


12. 




271 


7.2 


5111 


1.2 


3.8 




370 


7.9 


4658 


1.6 


4.1 




492 


6.8 


5411 


1.5 


3.7 




Fig. 2. Isometric view of wing-body 
overset grids in 12-foot PWT. 
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tively. Parallel efficiency on the 02K (based on 4 CPUs) is about 40% for proces- 
sor sets in the range of 60 to 124. The code achieves a maximum of 7600 Mflops 
on the 02K with 124 CPUs, 

as compared with approxi- Table 2. Parallel performance for LARGE test case: 
mately 4000 Mflops on the 33 million grid points. 

Cray C90 with 16 processors 
using the serial code with 
multitasking directives. Sim- 
ilarly, the LARGE grid case 
results, shown in Table 13 
indicate that this configura- 
tion scales well up to 96 and 
510 processors on the 02K 
and T3E, respectively. 

We note that scalability 
on the 02K tapers off sooner 
for the LARGE grid system 
than for the MEDIUM size 



Machine 


CPUs 


Walltime 

(sec/step) 


Mflops 


Comm, time 
(sec/step) 


Avg 


Max 


02K 


16 


52.0 


2650 


12. 


24.6 




48 


28.9 


4768 


5.0 


12.5 




96 


19.7 


6994 


4.5 


9.30 




124 


20.1 


6855 


5.5 


10.9 


T3E 


203 


34.2 


4025 


7.2 


20.4 




299 


31.0 


4432 


12. 


25.5 




400 


22.4 


6043 


5.2 


13.3 




510 


18.1 


7613 


4.8 


12.2 



configuration. This counter- 
intuitive result is due to the different distribution of grid sizes and inter-grid 
communications, and the concomitant poorer load balance. Performance on the 
T3E for the MEDIUM test case deteriorates beyond 271 CPUs due to commu- 
nication overhead, in addition to the poor load balance. 



5.2 Parallel Distributed Computing Performance 



The NASA IPG testbed cur- 
rently consists mostly of 
02K systems, and this is the 
platform we used for the par- 
allel distributed computing 
experiments. This was done 
to eliminate heterogeneity as 
a possible additional source 
of load imbalance. The ma- 
chines used are at NASA re- 
search centers in California 
(Ames), Virginia (Langley), 
and Ohio (Glenn), respec- 
tively. 

Since intra-grid inter- 
faces are much more commu- 
nication intensive than inter- 
grid Chimera interpolation, 
partitioned grids are never 
split across geographically 



Table 3. Parallel distributed performance for 
MEDIUM test case, using IPG-Globus. 



Number of Processors 


Walltime 

(sec/step) 


Comm. Time 
(sec/step) 


Ames 

(CA) 


Langley 

(VA) 


Glenn 

(OH) 


Min 


Max 


1 


1 


0 


195 


15 


48 


2 


0 


0 


175 


1 


27 


2 


1 


1 


98 


5 


30 


4 


0 


0 


91 


0.7 


29 


2 


5 


1 


52 


1 


18 


8 


0 


0 


43 


0.3 


9 


2 


12 


2 


32 


2 


16 


0 


16 


0 


23 


0.1 


13 


8 


8 


8 


26 


2 


13 


0 


0 


24 


15 


0.2 


8 



190 M.J. Djomehri et al. 



separated machines. Moreover, the method for latency hiding, described in our 
previous work Q, is applicable only to Chimera updates, not to pipelining of 
the line solves in OVERFLOW. Performance results are summarized in Table 0 
for the MEDIUM size configuration only, because the IPG testbed systems used 
are relatively small. Several runs were done on various numbers of processors 
selected from each machine. Listed are the results for totals of 2, 4, 8, 16, and 24 
processors on multiple resources. For comparison, we also list single-resource re- 
sults for the same number of processors. The last two columns of Table Elrefer to 
the minimum and maximum communication time per step, over all processors. 

It follows that the scalability of the code on up to 24 processors is reason- 
able, although it is clear that the communication overhead becomes increasingly 
significant beyond 8 processors. The comparison of results for multiple versus 
single resources shows that both minimum and maximum communication times 
are significantly increased. While the maximum value is influenced by many 
factors (including a bad load balance), the minimum is governed virtually ex- 
clusively by the communication speed. The increase of this minimum by about 
an order of magnitude on multiple, geographically separated resources points to 
the necessity of latency hiding, as argued in PJ, and in Section 0 below. Future 
plans also include algorithmic enhancements for better load balancing. 

6 Challenges 

The performance results discussed above demonstrate the feasibility of parallel 
and distributed computing on homogeneous IPG testbeds, although performance 
is significantly affected by an increase in communication time. In a realistic IPG 
environment, poorer connectivity and larger latencies due to geographical sepa- 
ration of the computers used, could further impact performance. Modifications 
must be made that minimize synchronization idle time and that hide latency by 
overlapping communication with computation. Currently, none of the aforemen- 
tioned techniques have been implemented in OVERFLOW. 

6.1 Improved Load Balancing Strategy 

It is evident from Tables mm and El that there is a significant imbalance in the 
amounts of time spent on communications between the processors. The compu- 
tational load imbalance is not listed explicitly, but it is on the order of 20% or 
more for most cases. Consequently, significant room for improvement is present. 
We propose the following strategy for better balancing the load. 

Introduce the concept of effective grid points. These are weighted such that 
each effective point takes the same amount of computational work. The weight 
will be deduced from the presence or absence of a turbulence model. 

Introduce the concept of effective communication volume, to be associated 
with exterior grid boundaries (related to Chimera interpolation updates), and 
with internal grid boundaries (related to partitionings of individual grids and the 
ensuing communications required by the implicit solution process) . This volume 
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equals the number of interface points, times the relative weight of the points 
(five for Chimera points, 76 for internal boundary points). 

The first step in improving the current algorithm is to allow processors to 
work on job mixes that contain both partial and whole grids. Since subdividing 
single grids incurs a significant communication cost, we use the heuristic that 
it is best to limit that partitioning process to the smallest number of subdivi- 
sions possible. Hence, we use the method outlined in Section 0 for distributing 
those grids that exceed the maximum number of points allowed per processor, 
regardless of the number of effective points involved. 

Subsequently, we construct a graph whose nodes consist of the numbers of 
effective grid points per grid or — for distributed grids — per grid subdivision, 
and whose edges consist of the effective communication volumes between grids 
and/or subdivision. This graph can then be partitioned using any of a number 
of efficient graph partitioners (for example, MeTiS mi, as proposed in (Q|) that 
are capable of balancing total node weight per partition while minimizing total 
weight of the cut edges. The only constraint is that no partition receive more than 
one grid subdivision. However, the way that subdivisions are created guarantees 
that no more than one of them will fit on any processor anyway. The reason why 
processors should not receive more than one partial grid is that all processors 
that cooperate on a particular grid need to synchronize. If some participate in 
multiple such synchronized operations, it quickly becomes impossible to balance 
the load. But if they only participate in one during each time step, this can be 
scheduled as the first computational task, thus avoiding synchronization penalty. 

When the partitioning is complete, several processors will generally be over- 
subscribed, having received too many grid points. The number of excess points 
is totaled, and a new number of processors is extrapolated from it, after which 
the assignment process is repeated until no processor exceeds the maximum al- 
lowable number of points. Implementation of this strategy requires only little 
coding. 

Extension of this method to a widely distributed computing environment is 
relatively simple; the grouping strategy takes place in two stages. The first stage 
only assigns collections of whole grids to individual, geographically separated 
platforms, whose computational resources are listed as aggregate quantities. 
Again, we use a graph partitioner to balance computational loads and mini- 
mize communication volumes. The second stage balances the load within each 
platform — potentially partitioning large grids — using the method described 
above. 

6.2 Latency Tolerance 

The second ingredient for improving performance of distributed computing con- 
sists of techniques that hide communication. One such approach was imple- 
mented and tested in our previous work p. In this approach, named, Deferred 
Strategy, the numerical time-advancement procedures of the solution scheme 
were altered. In the original synchronous time-stepping method within OVER- 
FLOW, the Chimera boundary data is exchanged and updated prior to the start 
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of each time step. In the deferred, asynchronous scheme, the Chimera boundary 
data exchange is initiated at the beginning of each time step, but completion of 
the exchange and the subsequent boundary value update is postponed to the end 
of the step, thus creating substantial opportunity for overlapping communication 
with computation. 

In the deferred scheme, the Chimera updates lag one time step behind as 
compared with the original time-stepping method, introducing further explicit- 
ness (in a numerical sense) into the iteration procedure that might affect stabil- 
ity. Experiments conducted with this approach, however, show no convergence 
degradation for a rather large-scale, steady-state application. Unfortunately, nei- 
ther do they show improvement in performance on two geographically separated 
02K machines (total of up to eight processors). The reason is that the asyn- 
chronous algorithm was not supported by asynchronous hardware. A successful 
approach would require one or more co-processors allocated on each separate 
machine for data exchange alone, freeing other processors for computation. A 
software emulation of this concept, MPIHIDE mi, was initiated as a research 
project at the time of this work at Argonne National Laboratory, but has not yet 
been tested in conjunction with OVERFLOW. It is a subject of future research 
in this area. 
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Abstract. In this paper, we present a parallel algorithm for solving the 
congruent region problem of locating all the regions congruent to a test 
region in a planar figure on a mesh-connected computer(MCC). Given 
a test region with k edges and a planar figure with n edges, it can be 
executed in 0(y / n) time if each edge in the test region has unique length; 
otherwise in 0{ky/n) time on MCC with n processing elements(PE’s), 
and in 0(y / n) time for both cases using kn PE’s, which is optimal on 
MCC within constant factor. We shall show that this can be achieved 
by deriving a new property for checking congruency between two regions 
which can be implemented efficiently using RAR and RAW operations. 
We also show that our parallel algorithm can be directly used to solve 
point set pattern matching by simple reduction to the congruent region 
problem, and it can be generally implemented on other distributed mem- 
ory models. 



1 Introduction 

A mesh-connected computer(MCC) consists of several processing elements (PE’s) 
each of which is connected to its four neighbors. Due to its architectural sim- 
plicity and easiness of exploiting massive parallelism, MCC has long been used 
as models of parallel computation in diverse application areas. In this paper, we 
deal with the congruent region problem(GRP ) on MCC. 

Given a planar figure G and a test region R , CRP is to locate all the re- 
gions in G which are congruent to (See figure 0) The problem arises fre- 

quently in several fields such as pattern matching, computer vision, and CAD 
systems. There have been a lot of sequential algorithms related to the congruity 
problems HTTfij ■ However, there are very few parallel algorithms for solving CRP. 
Shih, et. al.fjfj gave an 0(n 2 ) parallel algorithm on a linearly-connected com- 
puter(LCC) with k PE’s and O {kn) memory, where n and k are the number of 
edges in G and R respectively, and based on their algorithm Boxer [pjj presented 
O (nlogn) and O (logn) parallel algorithms on CREW model with k and kn PEs 
respectively using O (kn) space. In this paper we present a parallel algorithm 
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Fig. 1 . Congruent region problem(CRP) 



which can be executed in 0(i Jn) time using 0(n) space if each edge in the test 
region has unique length; otherwise in 0{k^/n) time using O (kn) space on a 
MCC with n PEs, and in 0(y/n) time using O (kn) space for both cases on MCC 
with kn PEs, which is optimal within constant factor. As far as we know, this is 
the first parallel algorithm for CRP proposed on MCC. The previous works on 
LCC and CREW try to check if two regions are congruent using the relation of 
two adjacent edges in G and R. However, with such method, they can hardly be 
implemented with the same time complexity as ours on MCC even when using 
the same number of PE’s as ours. In our parallel algorithm, we derive a new 
property which checks the congruent relation using the relative relation of each 
edge with respect to one distinct edge in R. This property enables us to de- 
sign the parallel algorithm with the proposed time complexity on MCC by using 
RAR(random access read) and RAW(random access write) operations. Given a 
point set S of n points and a pattern set R of k points, point set pattern match- 
ing(PSPM) is to find all subsets P C S such that R and P are congruent. PSPM 
is similar to CRP, but differs in that it deals with point sets instead of edge sets 
in planar figure. We shall show that our parallel algorithm can be directly used 
to solve PSPM by reducing it to the proper, congruent region problem. The pre- 
vious sequential or parallel algorithms [3,4]* for PSPM can not be implemented 
with the same time complexity on MCC. 

The outline of our paper is as follows: In section 2 we explain basic operations, 
some definitions and property of the congruent region. In section 3 we present 
parallel algorithms for solving CRP and in section 4 show how they can be used 
for solving PSPM. In section 5 we give a conclusion. 

2 Preliminaries 

In this section we shall explain basic operations on MCC, define the congruent 
region problem more formally by introducing some notations and terminologies, 
and then describe an important property which shall be used for the efficient 
computation of congruent regions in our algorithm. 
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Fig. 2. Illustration of Definitions 



We shall use RAR and RAW as the basic operations on MCC arranged in 
\fn x yjn array. In a RAR, each PE specifies the key of data it wishes to receive, 
or it specifies a null key, in which case it receives nothing. Each PE reads a data 
from the PE which contains its specified key. Several PE’s can request the same 
key and read the same data. In a RAW, each PE specifies the key of data and 
sends its data to the PE which contains its specified key. If two or more PE’s 
send data with the same specified key, then the PE with that key will receive the 
value obtained by summing all the data sent to it. RAR and RAW take Ofijn) 
time on MCC 0Hj- In MCC, 0 (^/n) time complexity is optimal within constant 
factor. 

Let G be a graph with a vertex set V and an edge set E. A graph G is called 
a planar figure if it is planar and all the edges in E are straight lines. 
Definition 1: Let ei.vi and be the end vertices of an edge e+ A region P 
is a cyclic chain of directed edges {eo,ei,..,efc_i} such that ei.V2=eiJ r i.v\ for all 
i, 0 < i < k and each edge e* is a straight line directed from ej.v i to ei.v 2 so that 
the interior of the region lies to the right side of each edge in P. Let ei-length 
and ei. angle denote the length of and the interior angle between and e,;+i 
respectively. (See figure El a.) 

We assume that all the subscripts are modulo k. Let 5'={so, Si,.., Sfc_i} and 
f?={ro, ri,..., rk- 1} be the regions in the plane. 

Definition 2: A pair of edges (sj, Sj+i) in S is said to be e-congruent to a pair of 
edges ( rj , Tj+i ) in R for i and j, 0 < i, j < k if and only if Si.length=r j .length, 
Si+i.length=r j+\. length 1 and Si.angle=i'j. angle. S is said to be r- congruent to 
R if there exist i, 0 < i < k such that (sj+t, Sj+4+1) is e-congruent to (r t , 7*4+1) 
for all t, 0 < t < k. 

Therefore, we can easily see that S is r-congruent to R if S can coincide 
with R by proper geometrical translation, rotation and reflection. CRP can be 
described formally using the above definition as follows: Given a planar figure 
G and a test region J?, CRP is to find all the regions of G each of which is 
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Fig. 3. Illustration of Lemma 1 



r-congruent to R. From now on we shall refer to the region which is r-congruent 
to R simply by the congruent region. 

Definition 3: If S is r-congruent to R, each edge s* in S is one-to-one corre- 
spondence to an edge r 7 in R such that (sj,Si+i) is e-congruent to (rj,rj+ 1 ), 
and Si is said to correspond to r 7 . We define the starting edge of S to be the one 
which corresponds to rg. 

In the following we derive another condition for S to be r-congruent to R, 
which can be used more efficiently for the computation of congruent regions on 



Definition 4: For two edges e, and in a region P = {e 0 ,ei, efc_i }, the line 
connecting ei.v 2 and ej.vi is denoted £{et,ej ) and its length by £{et,ej). length, 
and the interior angle between (resp. e.j) and £{ei,ej) by a(e»,ej) (resp. 
/3(e,,ej)). (See figure 0b .) If e.j is identical to ej+i, then a(ei,ej) and /3(ei,ej) 
are all equal to e^. angle. Then, the relative position of Cj with respect to ei 
is defined by a record RP{ei,ej) = ( ei.length, ej. length, a(et,ej), /3(ei,ej), 
i[ei, Bj) .length) . 

Definition 5: A pair of edges (si, Si +t ) in S is said to be relative e-congruent 
to a pair of edges (7'o, r t ) in R for *, 0 < * < k if RP(st, Si+t ) = RP{r 0 ,r t ). 
Lemma 1: S' is r-congruent to R if and only if there exist i, 0 < i < k such that 
( Si , Si+t) is relative e-congruent to (7'o, rt) for all t, 1 < t < k. 

Proof: Suppose that (sj,Sj+t_i) and ( Si,Si+t ) are relative e-congruent to 
(r 0 ,r t _i) and (r 0 ,r t ) respectively for 1 < t < k. We want to show that 
(si +t _ i, Si +t ) is e-congruent to ( r t _i,r t ). Since Si +t -\.length and Si +t .length 
are equal to rt-i-length and rt-length respectively, all we have to prove is 
that .angle is equal to rt.-i-angle. Let T (resp. T') be a triangle de- 

termined by the vertices a,b,c (resp. a' ,b' ,d), where a,b,c (resp. a',b',d) are 
Si.v 2 , s i+ t- i-v\, Si+t-Vi (resp. r 0 .v 2 , r t - i-Ui, r t .v i) respectively. (See figure El) 
Since £(si, Si+t-i) .length = £{ro, r t -\). length, Si+t-i-length = rt-i-length , and 
/3(si, Si+t-i) =P(ro,r t -i), T is congruent to T' . Therefore, Lbca = Lb' d a! , and 
hence Si+ t -i-angle = r t -\.angle, since /3(s.i,Si+ t ) = /3(?'o,r t ). We shall prove 
the other direction by induction. Suppose that (si+t, Si+t+i) is e-congruent to 
(r t ,r t+ i) for all t, 0 < t < k without loss of generality. Clearly, (si,Si + i) is 
relative e-congruent to (rg,ri). Suppose that (sj,Si+t_i) is relative e-congruent 



MCC. 
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to (ro,r t _i). We shall show that (sj, s,; +t ) is relative e-congruent to (ro,r t ). 
Since S is e-congruent to R, Si.length = ro-length and Si+t-length = rt-length. 
Since £(si, length = £(ro,rt-i) -length , Si+t-i-length = rt-i-length, and 

/3(si, Si+t.-i) = /3{ro,rt-i), T is congruent to T' . Therefore, £(si,Si+t) -length 
= £(ro,rt) -length, Ibcic = Lb'a'c' and Z6ca = Lb' d a' , and hence a(sj,Sj+t) 
= a(r 0 ,r t ) and (3(si,s i+t ) = (3{r 0 ,r t ), since a(si,s» +i _i) = a(r 0 ,r t - 1 ), and 
Si+t-i-angle = rt-i-angle, and it follows that (sj,Sj+t) is relative e-congruent 
to (r 0 , r t ). • 



3 Parallel Algorithm 

In this section we shall explain how to find all the regions in a planar figure which 
are r-congruent to a test region. First we shall describe a parallel algorithm for 
the unique case that each edge in the test region has distinct length, and then 
for the general case with no restriction using the lemma 1. Each edge e in a 
planar figure G is duplicated into one directed from e.V\ to e.Vi and the other 
from e.v-i to e.iq in order to take care of the geometrical reflection with respect 
to e. Let E = {eo,ei, ..,e„_i} be a set of all the duplicated edges in G and let 
R = {r 0 , r\, .., rfc-i} be a test region. Initially, all the edges in E and R are evenly 
distributed in local memory on MCC. Each congruent region S is represented 
by the index of the starting edge in S which corresponds to rg. 

1). Unique Case: There exists, for each edge e* in E, unique edge rj c with the 
same length as e, if it does. The detailed description of the algorithm is given 
below. 

Parallel Algorithm Congruent-Region-I 

Input: A set E={e o, e\, ..., e„_i} of edges for a planar figure G and a set 
f?={ro, ri, ...,rfc-i} of edges for a test region such that the length of each 
edge in R is unique. 

Output: Each edge e, in a congruent region S sets ei-flag to true and stores 
into ei.reg and e,.corr respectively the index of the starting edge of S which 
corresponds to ro and the index of edge in R to which e-i corresponds. 

1) . Find, for each edge ri, its relative position RP(r 0 ,ri ) with respect to r 0 , 

and store it into ri.rel. 

2) . Find, for each edge e, in E, an edge r ic in R with the same length as e, if 

it exists, and store the index i c of r.j c and the relative position ri.rel into 
ei.corr and et-rel respectively, and set ei-flag to true. This can be done 
using RAR with the length as key. Note that e, should correspond to r, c if 
there exists a congruent region to which ey belongs. 

3) . Find, for each edge e* with e.j -flag set to true , an edge e io in G such that 

(e.j 0 ,ej) is relative e-congruent to (ro,rj c ). If there exists such ej 0 , store ?'o to 
ei.reg ; otherwise set ei-flag to false. This can be done by first determining, 
for each edge e,, the two end vertices of ej 0 from e* and ei-rel, and then 
finding ej 0 in G by executing RAR with those two end vertices as key. 

Comment: At step 3) we have computed, for each edge e* with the same length 
as rj c , ej 0 which may be the starting edge of the congruent region where e* 
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corresponds to r.; c . However, we have not determined yet whether there exists 
a congruent region with ej 0 as the starting edge or not. The following lemma is 
used to check whether there exists a congruent region with ej 0 as the starting 
edge. 

Lemma 2: For a set S of k edges, S is congruent to R if and only if ej 0 is 
identical for every edge e,; in S. 

Proof: Suppose that ej 0 is identical for every edge e.j in S. For two distinct edges 
e.j and ej in S, rj c should be different from rj c , since if rj c is identical to r Jc , 
then (ej 0 ,ej) should be relative e-congruent to (ej 0 ,e,,), making contradiction. 
Therefore, for each edge e* in S there must exist unique r, c in R such that 
(ej 0 ,e.j) is relative e-congruent to (ro,rj c ), and hence S is r-congruent to R by 
lemma 1. The other direction follows directly from lemma 1. ■ 

Therefore, we can determine, for each edge e p in G, whether there exists a 
congruent region with e p as the starting edge by checking whether there is a set 
S of k edges such that for each edge ej in S, ej 0 is e p . Using this property we 
can find all the edges in E which are the starting edges of the congruent regions 
in G, and hence we can determine, for each edge e.j, whether it belongs to a 
congruent region or not by checking whether ej 0 is a starting edge. 

4) . Check, for each edge in E with its flag set to true, whether it may be the 

starting edge of a congruent region. This can be done as follows: Send, for 
each edge e.j with ei.flag set to true, 1 to the PE containing ei 0 .sum using 
RAW. Then ei 0 .sum stores the sum of all the l’s sent to it, and if it is k, 
there exists a congruent region with ej 0 as the starting edge by lemma 2, 
and set ei 0 .flag to true ; otherwise false. 

5) . Check, for each edge ej with ei.flag set to true, whether it may belong to 

the congruent region with ej 0 as the starting edge. Since we have determined 
whether there exists a congruent region with ej 0 as the starting edge at step 
4), this step can be done simply by checking whether ei 0 .flag is true or not 
using R.AR. Unless ei.flag is not true, there does not exist a congruent 
region with ej 0 as the starting edge, and set ei.flag to false. After this step, 
all the edges in the congruent regions have their flags to true, and each edge 
ej in the congruent region S stores into ei.reg and ei.corr respectively the 
index of the starting edge in S and the index of the edge in R to which e* 
corresponds. 

End Algorithm. 

The correctness of the parallel algorithm follows from the discussion in the 
algorithm, and the following theorem holds. 

Theorem 1: Given a planar figure G with n edges and R with k edges, all the 
congruent regions can be computed in 0(y/n) time on MCC using n PE’s and 
O(n) space. 

Proof: Step 1) takes 0(y / u) time for distribution of ?’o- Step 2) and 3) can be 
done in 0(^/n) time for R.AR. respectively. Step 4) can be executed in 0(^/n) time 
for RAW and step 5) in 0(y/n) for R.AR. Therefore, the overall time complexity 
is 0(i/n). Since we need 0(1) space for each edge in E and R, the overall space 
complexity is 0(?r). ■ 
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2). General Case: Each edge e* in E may belong to more than one congruent 
regions and correspond to different edges of R in each congruent region to which 
it belongs. Therefore, in the general case we need to check, for each edge e.;, 
whether there is a congruent region such that e, corresponds to rj, for j = 
0, 1, 2, k— 1. At the jth iteration, the computation of a congruent region where 
e.i corresponds to r ; is carried out similarly as in the unique case. The detailed 
description of the parallel algorithm is given below. 

Parallel Algorithm Congruent-Region 

Input: A set E={e o, ei, ..., e„_i} of duplicated edges for a planar figure G 
and a set R={r 0 , n, ...,rk- 1 } of edges for a test region. 

Output: Each edge e* in E is associated with an array A t of size k, each com- 
ponent of which consists of three fields: flag, reg, and rel. If Aflfl.flag is set 
to true, e,; belongs to the congruent region S where e* corresponds to r, , 
and A.; [j] .reg stores the index of the starting edge of S. Aflfl.rel stores the 
relative position RP(ro,rj). 

Comment: For each edge ei in E there may exist more than one edges of R 
whose lengths are equal to e^. Therefore, e* may correspond to more than 
one edges of R in different congruent regions. In order to distinguish the 
congruent regions where e, belongs but corresponds to different edges of R, 
ei is associated with an array A of size k such that A,[j].flag is true if e* 
belongs to a congruent region where e,; corresponds to rj and Ai [j] .reg stores 
the index of the starting edge of the congruent region if it exists. 

1) . Check whether each edge in R has unique length or not. If it is, find all the 

congruent regions using the parallel algorithm Congruent-Region-I. 

2) . For each edge e, do the following for j = 0, 1, .., k — 1. 

2.1) . Find the relative position RP(r 0 ,rj) and store it into Aflfl.rel. 

2.2) . Set A t [j].flag to true if et.length is equal to rj.length. 

3) . For each edge e* with A,[j\.flag set to true do the following for j = 

0,1,.., A;- 1. 

3.1) . Find an edge ej 0 in E such that (ej 0 , e») is relative e-congruent to (ro, rfl) 
if it exists. This can be done similarly as in step 3 of the previous algorithm 
for the unique case. If there exists such e,; 0 , set A,[j].flag to true , and store 
*o to Ai[j].reg. 

Comment: At the jth iteration of step 3.1 we have computed, for each edge 
ei, ei 0 which may be the starting edge of the congruent region where e* 
corresponds to rj. However, we have not known yet whether there exists a 
congruent region with ej 0 as the starting edge. In the next step we check, for 
each edge ej 0 in E, whether there exists a congruent region with it as the 
starting edge. 

3.2) . Send 1 to the PE containing e io .sum using RAW, where i 0 = Afljj.reg. 
Then each PE containing ei 0 .sum receives 1 and add it to ei 0 .sum. 

4) . After step 3) ei 0 .sum in each PE stores the sum of all the l’s sent to it. If 

the sum is k, there exists a congruent region with ej 0 as the starting edge 
by lemma 2, and set A Uj [0\.flag to true, otherwise false. 

Comment: If A io [0\.flag is set to true, then there exists a congruent region 
S with ej 0 as the starting edge, and every edge e, with Ai[j].reg equal to i o 




Parallel Congruent Regions on a Mesh-Connected Computer 201 



becomes an edge of S which corresponds to r ? . Therefore, we can identify 
all the edges in the congruent regions as follows. 

5). For each edge e» with Ai[j\.flag set to true do the following for j = 
0,1 ,..,k — 1: Check whether Ai o [0).flag is true or not by RAR, where io 
= Ai[j].reg. If it is true, e, is an edge of the congruent region with ei 0 as 
the starting edge; otherwise set Ai[j).flag to false. After this step if e.; cor- 
responds to rj in the congruent region with the starting edge e, 0 , Ai[j].flag 
is set to true and Ai[j].reg stores the index i 0 of the starting edge of S. 
End Algorithm. 

Theorem 2: Given a planar figure G with n edges and R with k edges, all the 
congruent regions can be computed in 0(fci/n) time on MCC using n PE’s, and 
in O {y/n) time using kn PE’s and O (kn) space. 

Proof: The correctness of the parallel algorithm follows directly from the dis- 
cussion given in the algorithm. 

4 Point Set Pattern Matching 

In this section we shall show that the parallel algorithm for finding congruent 
regions can be used in solving PSPM by transforming it to the proper CRP. Given 
a point set S = {so, Si, .., s n _i} of n points and a pattern set i?={ro, r \, .., rk-i} 
of k points in Euclidean 2D space, PSPM is to find all subsets P C S such that 
R and P are congruent. The parallel algorithm works by executing, for each Sj 
in S, a subprocedure Pattern-Matching for finding all subsets P C S such that R 
and P are congruent while anchoring ro at Sj. (See figure 0) The subprocedure is 
essentially identical to the parallel algorithm Congruent -Region described in the 
previous section. It can be described briefly as follows: Let P = {po,Pi, -,Pk- 1 } 
be a subset of S with p 0 identical to Sj. Figure Ek illustrates five points ro ~ r± 
in a pattern set P, and figure Ek and c respectively show eight points Sq ~ S 7 
in S and five points po ~ Pa in P C S congruent to R while anchoring r 0 at s 5 . 
Suppose each point r* (resp. pi) in R (resp. P) is connected to ro (resp. po) by 
an edge e* (resp. df). (See figure 0) For explanation purpose, we assume that 
all the points in R and S are sorted in circular order. (In the real algorithm, the 
circular sorting is not necessary, since we use the relative position of each edge.) 
For each edge e*, its relative position RP(ei,ej) with respect to e :] is defined 
by a record (ei.length,ej. length, a(ei,ej)), where a(ei,ej) is the interior angle 
between e, t and ej. Similarly, for each edge dj, its relative position RP(d i: dj) 
with respect to dj is defined by a record (di.length, dj .length, a(dj, dj)), where 
a(di,dj) is the interior angle between d,; and dj. We assume that for subscript 
i, i + t is (i + t modulo k) +1 if i + t > k. 

Definition 6: A pair of points ( pi , pj) in P is said to be relative p-congruent to 
a pair of points (r a , rf) if RP(di,dj) = RP(e a ,eb). 

Definition 7: P is said to be m-congruent to R if and only if there exists i, 
1 < i < k such that (pt+i, Pi) is relative p-congruent to (r t+ 1 , r 1 ) for all t, 
1 < t < k - 1 . 

We can easily see that P is m-congruent to R if P coincides with R by proper 
geometrical translation, rotation and reflection, since the relative position of two 
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Fig. 5. Illustration of Definitions 



corresponding edges in P and R is identical. Therefore, the subprocedure Pat- 
tern-Matching is reduced to the problem to find all the subsets P of S each of 
which is ni-congruent to R, which is identical to the congruent region problem 
with minor change in relative position. Thus it can be computed in parallel with 
the same time and space complexity as the parallel algorithm Congruent-Region 
shown in the previous section. Since we have to iterate the subprocedure Pat- 
tern-Matching n times for PSPM, the following theorem follows: 

Theorem 3: Given point sets S with n points and R with k points in Euclidean 
2 dimensional space, PSPM can be computed in O^z 1 " 5 ) time using O(n) space 
if the length of each edge for r,; in R is unique; otherwise in O (kn 15 ) on MCC 
with n PE’s using O (kn) space, and in (^(n 1 ' 5 ) time for both cases on MCC with 
kn PE’s using O (kn) space. 
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5 Conclusion 

Our parallel algorithm for CRP on MCC has several research contributions: 
First, as far as we know, it is the first parallel algorithm proposed on MCC, 
which takes 0(y/n) time if all the edges in R have unique length; otherwise 
0{ky/n) time using n PEs, and 0(y/n) time for both cases using kn PEs, which 
is optimal within constant factor. Second, it can be exploited efficiently for solv- 
ing more complex PSPM with simple reduction to CRP. Note that the previous 
parallel algorithms for solving CRP or PSPM on other parallel models such as 
CREW and LCCPS?] can not be directly implemented with the same time 
complexity as ours on MCC even when using the same number of PEs as ours. 
Investigating the congruence relation by checking the relative position between 
edges as shown in lemma 1 and 2 enabled us to exploit RAR and RAW op- 
erations for the efficient location of congruent regions and hence develope the 
parallel algorithm with the proposed time complexity on MCC for CRP and 
PSPM. Third, the parallel algorithm for CRP and PSPM can be generally imple- 
mented on any parallel distributed memory model such as CCC (cube-connected 
computer), TC(tree-connected computer), PSC(perfect shuffled computer), etc 
without having to design new parallel algorithm if they are provided with RAR 
and RAW operations, since our parallel algorithm makes use of them as basic 
operations. If it takes 0(T(n)) time for RAR and RAW, CRP can be executed 
in 0(T(n)) if all the edges in R have unique length; otherwise O (kT(n)) time 
using n PEs, and 0(T(n)) for both cases using kn PEs. For example, T(n) is 
O (log 2 n) for CCC, TC, and PSC in distributed memory model. It still remains 
as an open problem to find an optimal parallel algorithm for PSPM on MCC 
with n PEs for higher dimensions. 
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Abstract. Collective communications compose a significant portion of 
the communications in many high performance applications. The multi- 
destination message passing mechanism has been proposed recently for 
efficient implementation of collective operations by using a fewer num- 
ber of communication phases. However, this mechanism has been only 
employed for improving the performance of non-personalized collective 
operations such as multicast. In this paper, we investigate whether mul- 
tidestination message passing can also help personalized communications 
such as one-to-all scatter. We propose a new scheme, called Sequential 
Multidestination Tree-based (SMT) scheme, which takes advantage of the 
multidestination message passing mechanism and provide a better per- 
formance. For a range of system sizes and parameters, it is shown that 
the SMT scheme outperforms other known schemes for a wide range of 
message lengths. 



1 Introduction 

Collective communications such as broadcast, multicast, global reduction, scatter, 
gather, complete exchange , and barrier synchronization have shown to compose 
a significant portion of the communications in many high performance applica- 
tions [4]. Furthermore, these operations are used by the underlying system for 
operations such as resource management and maintenance of cache coherency 
in Distributed Shared Memory systems [2]. Thus, it is crucial to develop effi- 
cient mechanisms and algorithms for implementing these operations. Collective 
operations can be classified into two majorgroups: non-personalized and person- 
alized operations. In non-personalized operations, the same message is exchanged 
among the participating nodes. In personalized operations, different messages are 
exchanged between the participating nodes. 

A new multidestination message passing mechanism [6] has been proposed 
recently for efficient implementation of collective operations by using a fewer 
number of communication phases. The multidestination message passing schemes 
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OSU Presidential Fellowship, an NSF Career Award MIP-9502294, NSF Grant CCR- 
9704512, an Ameritech Faculty Fellowship award, and grants from the Ohio Board 
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have been used to implement many non-personalized collective communications 
such as multicast, broadcast, and barrier synchronization. However, the problem 
of improving the performance of personalized operations (such as scatter, gather 
and complete exchange) by using multidestination message passing mechanism 
has remained unsolved. 

In this paper, we take on such a challenge and demonstrate that the perfor- 
mance of personalized collective communications can also be improved by using 
the multidestination message passing mechanism. In particular, we focus on the 
one-to-all personalized communication simply known as scatter [4]. We focus on 
systems supporting wormhole switching. As a starting point, we consider only 
systems with fc-ary n-cube regular topologies and router-based multidestination 
message passing. However, the framework can be generalized to other systems 
with switch-based and network interface-based multidestination message pass- 
ing. 

The paper is organized as follows. The previously known schemes for scatter 
communication are discussed in Section 2. In Section 3, we present a new scheme 
(SMT) for implementing scatter. Performance evalrujdon results are presented 
in Section 4. Finally, we conclude the paper with conclusions and future research 
directions j-| 

2 Current Approaches for Implementing Scatter 
Communication 



In this section, we define the scatter collective communication operation and de- 
rive its lower bound for wormhole-routed systems. Then, we present two unicast- 
based scatter schemes which are used on current systems. We also investigate 
the conditions under which these schemes can meet the lower bound. 



2.1 Scatter Communication and Its Lower Bound 



Scatter is one of the personalized communication operations in which, one pro- 
cessor (source node) sends a different message to every other processor in a 
user-defined group [3]. For a system with P processors and message size of l 
bytes, the volume of-.the data which has to be sent out from the source node 
to the network is (P — 1)Z, and at least one message should be sent out. There- 
fore, the lower bound (LB) of the scatter operation latency for wormhole routed 
systems (in the absence of congestion) 1 can be derived as: 



T LB = t s + (P-l)lt 



(1) 



where t s is the communication start-up overhead and t. p is the transmission time 
per byte. 



i 



We ignore the impact of node delay for simplicity. 
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2.2 Sequential Tree (ST) Scheme 

The ST scheme is based on sequential trees. In this scheme the source node 
sends a separate message to each of the other processors participating in scatter. 
Therefore, the number of required communication steps is P — 1. Figure l.a 
shows the sequential tree for a system with 8 processors. The algorithm [£j)r 
implementing the ST scheme can be found in [1]. 

□ 




Fig. 1. Sequential tree (ST) (a) and Binomial Tree (BT) (b) schemes for an 8-processor 
system. 



Let us analyze the latency of the scatter communication under the ST scheme. 
Depending on the system parameters, two different cases can be recognized: 
t s < lt p and t s > lt p . Thus, the scatter latency can be written as: 

T = f ts + ( p - l ) lt P if ts < Up , , 

ST \(P-l)t s + lt p i{t s >lt p 

It is obvious that only if t s < lt p then the lower bound can be met. If this 
condition is not satisfied the latency of the scatter operation will exceed the 
lower bound. Because of the ( P — l)f s term, the latency of scatter can increase 
rapidly when the system size increases or the start-up overhead is significant. 
This leads to the following observations: 

Observation 1 The ST scheme can achieve the lower bound only when l > . 



Observation 2 With l < , the latency of scatter increases rapidly with in- 

crease in system size. 

2.3 Binomial Tree (BT) Scheme 

The BT scheme is based on the well known binomial trees. Such a scheme is used 
in the MPL library for scatter implementation. In this approach, the number of 
processors which receive the message is doubled in each communication step. 
Figure l.b illustrates the BT scheme on an 8-processor system. The algorithm 
for impjkjmenting the BT scheme is presented in [1]. 

□ 



Scatter Communication with Multidestination Message Passing 



207 



Let us analyze the latency of the scatter communication under the BT 
scheme. Since the number of processors which receive their messages is dou- 
bled in each step, the number of required communication steps for the entire 
operation is log 2 P. On the other hand, the data size decreases by a factor of 
two in every successive step. Therefore, the latency of scatter can be written as: 



It can be seen that the BT scheme can never meet the lower bound due to 
additional number of start-ups. This leads to: 

Observation 3 The BT scheme is not capable of achieving the lower bound. 

3 Sequential Multidestination Tree (SMT) Scheme 

In this section, we use multidestination message passing to develop better schemes 
for scatter. (A brief overview of the multidestination mechanism is presented 
in [1].) First, we provide the motivation behind such schemes. Then, we propose 
a generalized scheme. Finally, we present schemes for 2D tori/meshes and k - ary 
n-ctibes. 

3.1 Motivation 

As presented in observations 1 and 3 only ST scheme can achieve the lower 
bound. Nevertheless, observation 1 shows that, the ST scheme can achieve the 
lower bound just for a limiteer vioiaTrons of system parameters and message 
sizes. This lower bound is achievetronly when the start-up overhead of sending 
messages are overlapped with the transmission time of other messages. To achieve 
this requirement, the t s < lt p condition should hold. In other words, the size of 
the messages being transmitted should be large enough so that the next message 
can become ready for transmission not later than when the previous message 
has been completely injected to the interconnection network ( l > I s -). In this 
section, we propose a new scheme that extends the cases in which the lower 
bound latency can be met by increasing the size of the messages which are 
being transmitted. The multidestination message passing mechanism helps us to 
achieve such increase of message length. 

3.2 The General Scheme 

In the SMT scheme, all processors which are participating in the scatter com- 
munication operation are grouped into smaller groups. The messages of all pro- 
cessors in each group are combined to form a multidestination message. Then, 
these different multidestination messages are sent out to the members of the 
corresponding groups. Depending on the routing scheme being used in the sys- 
tem and the arrangement of the data at the source node different groupings can 
be used. For example, as illustrated in Fig. 2, nodes in a 12-processor system 




log 2 Pts + {P - T)lt p 



( 3 ) 



□ 
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are grouped into four groups (G1-G4). With such grouping, the source node 
uses four messages to cover these groups. If the smallest group of processors has 
g (g > 1) members then the size of the shortest message that is going to be 
transmitted is gl. Therefore, the condition under which the lower bound is met 
can be written as: gl > . The minimum value of g to achieve the lower bound 

can be found from the following equation: 



9 min — 






( 4 ) 



Obviously, for bigger g ' s there are less constraints on system parameters and 
message sizes under which the lower bound can be achieved. However, the need 
for rearrangement of the original data and the routing constraints don’t allow 
selecting any arbitrary g’s. Assuming that all groups have g members, the exe- 
cution time of the SMT scatter scheme can be derived as: 



... _ ( t s + (P - l)lt p if ts < lgt p 

SMT - | _ l)/f p jf ts > l gtp 



( 5 ) 



Observation 4 The SMT scheme can achieve the lower bound when l > 

— gt p 




Fig. 2. Illustration of the Sequential Multidestination Tree (SMT) scheme. 

3.3 The SMT Scheme for 2D Tori, Meshes, and k - ary n-cubes 

Using this generalized approach we have show how this scheme can be used 
to implement scatter on 2D meshes/tori and higher dimensional systems with 
dimension-ordered routing [1]. In a k x k torus, it can be easily observed that 
the processors of each column can form a group. A multidestination message can 
deliver the data for all members of each group under dimension-ordered routing. 
The SMT scheme for 2D tori can be easily generalized for fc-ary n-cubes. To 
implement the SMT scheme on fc-ary n-cubes with dimension-ordered routing, 
it is enough to partition the processors along the lowest dimension and select g 
equal to fc. 
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4 Performance Evaluation 

In this section, We compare the latency of ST, BT, and SMT schemes for differ- 
ent system parameters and message sizes by using a wormhole-routed simulator 
called WORMulSim jjjj]. WORMulSim is a detailed flit level simulator which is 
built by using CSIM and takes care of flit-level contention. We consider an 8 x 8 
torus with t s = 5.0 [isec and t p = 10.0 nsec as our base system and investigate 
the effect of the variations of system parameters on the latency of the scatter 
schemes. We also present the factor of improvements obtained by using SMT 
instead of ST and BT for systems with different characteristics. 



4.1 Effect of Communication Start-up 

Figure Elshows the comparison residts for three different start-ups (t s = 1.0, 5.0, 
and 10.0 /zsec). It can be seen that the SMT scheme performs better than the 
ST scheme for all different message sizes. It can also be verified that the SMT 
performs better than BT for a wide range of message sizes. As t s decreases the 
minimum message length for which the lower bound can be achieved decreases 
(l > With t s = 10.0 /isec, the SMT scheme achieves the lower bound for 
messages of size 128 bytes and more. With t s = 5.0 and 1.0 /xsec, lower bound 
is achievable for messages of size 64 and 16 bytes, respectively. 

Figure 0 shows the factor of improvement of the SMT scheme over the ST 
scheme for three different startups. It can be observed that the ST scheme can 
never perform better than the SMT scheme. For a 1-byte message size, the SMT 
scheme can reduce the scatter latency up to a factor of 8 compared to the ST 
scheme. As t s increases, the SMT schemeoerforms better than the ST scheme 
for a given message length. Similarly, Fig.4J shows the factor of improvement of 
the SMT scheme over the BT scheme. It can be observed that the BT scheme can 
perform better than the SMT only for very short messages. For other message 
sizes, the SMT scheme can show up to 60% improvement compared to the BT 
scheme. It can be seen that as t s reduces the maximum message size for which 
the BT scheme can perform better than the SMT scheme decreases. 

As t s keeps on reducing for new generation systems, these results indicate 
that the SMT scheme can outperform the unicast-based schemes (ST and BT) 
for a wider range of system parameters and message sizes. 






-n'"' 
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Fig. 3. Latency of three different schemes for different start-up times. 
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Fig. 4. Factor of improvement ( Tst/Tsmt ) for different start-up times. 




Fig. 5. Factor of improvement ( Tbt/Tsmt ) for different start-up times. 




Fig. 6. Latency of three different schemes for different system sizes. 



4.2 Effect of System Size 

The effect of system size on the scatter schemes is shown in Fig. El ft is interesting 
to observe that by increasing the system size the minimum length requirement 
for achieving the lower bound in the SMT scheme decreases, ft can be seen that 
while the SMT scheme can not achieve the lower bound for messages shorter than 
128 bytes on a 4 x 4 system, the lower bound is achieved even for 64-byte and 
16-byte messages on 8 x 8 and 16 x 16 systems, respectively. Such performance 
improvement demonstrates the scalability of the SMT scheme with respect to 
increase in system size. 

4.3 Effect of Propagation Cost 

Figure Eshows the latency of the three scatter schemes for different propagation 
times (t p = 5.0, 10.0, and 20.0 nsec), ft can be observed that as propagation time 
increases, the performance of the ST scheme becomes worse. On the other hand, 
it can be seen that as t p reduces, the factor of improvement obtained by using 
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Fig. 7. Latency of three different schemes for different propagation times. 

the SMT scheme over the BT scheme increases. The factor of improvement for 
message size of 128 bytes and t p = 5.0 is 1.15. For a similar system with t p = 10.0 
and 5.0 nsec, the factors of improvement are 1.29 and 1.55, respectively. 

The minimum message size above which the SMT scheme outperforms the 
BT scheme for a large set of system parameters is presented in [1]. It is shown 
that for a wide range of system parameters, system sizes, and m^spage sizes the 
SMT scheme performs better than the BT scheme. 

5 Conclusions 

In this paper, we have presented a new approach to implement scatter com- 
munication in k - ary n-cube wormhole systems using multidestination message 
passing mechanism. Compared to the current unicast-based schemes, the pro- 
posed framework can meet the lower bound latency for a much wider range of 
system, technological, and application parameters. We are extending our work 
to irregular and tree-based topologies and switch-based and network interface- 
based multidestination message passing mechanisms. 
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Power: A First Class Design 
Constraint for Future 
Architectures 
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Abstract. In many mobile and embedded environments power is already 
the leading design constraint. This paper argues that power will also be a lim- 
iting factor in general purpose high-performance computers too. It should 
therefore be considered a "first class” design constraint on a par with perfor- 
mance. A corollary of this view is that the impact of architectural design de- 
cisions on power consumption must be considered early in the design cycle 
— at the same time that their performance impact is considered. In this paper 
we summarize the key equations governing power and performance, and use 
them to illustrate some simple architectural ideas for power savings. The pa- 
per then presents two contrasting research directions where power is impor- 
tant. We conclude with a discussion of the tools needed to conduct research 
into architecture-power trade-offs. 



1 Introduction 

The limits imposed by power consumption are becoming an issue in most areas of com- 
puting. The need to limit power consumption is readily apparent in the case of portable 
and mobile computer platforms — the laptop and the cell phone being the most com- 
mon examples. But the need to limit power in other computer settings is becoming im- 
portant too. A good example is the case of server farms. They are the warehouse- sized 
buildings that internet service providers fill with servers. A recent analysis presented in 
[ 1 ] has shown that a 25,000 sq. ft. server farm with about 8,000 servers will consume 
2MW. Further it was shown that about 25% of the total running cost of such a facility 
is attributable to power consumption, either directly or indirectly. 

It is frequently reported that the net is “growing exponentially,” it follows then that 
the server farms will match this growth and with them the demand for power. A recent 
article in the Financial Times noted that 8% of power consumption in the US is for in- 
formation technology (IT). If this component is set to grow exponentially without check 
it will not be long before the power for IT will be greater than for all other uses com- 
bined. 
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To get an idea of the trends in power consumption of today’s processors consider 
the following table, taken from [2]. The rapid growth in power consumption is obvious. 



Table 1 : Power Trends for the Compaq Alpha 



Alpha 

model 


Power 

(W) 


Frequency 

(MHz) 


Die size 
(mm 2 ) 


Supply 

voltage 


21064 


30 


200 


234 


3.3 


21164 


50 


300 


299 


3.3 


21264 


90 


575 


314 


2.2 


21364 


>100 


>1000 


340 


1.5 



The growth in power density of the die is equally alarming. It is growing linearly, so 
that the power density of the 21364 has reached about 30 W/cm 2 — three times that of 
a typical hot plate. This growth has occurred in spite of process and circuit improve- 
ments. Clearly, trading high power for high performance cannot continue, and architec- 
tural improvements will also have to be added to process and circuit improvements if 
the growth in power is to be contained. The only exceptions will be one-of-a-kind su- 
percomputers built for special tasks like weather modelling. 

In this introductory discussion we have briefly illustrated that power will be a lim- 
itation for most types of computers, not just ones that are battery powered. Furthermore, 
we have argued that system and architectural improvements will also have to be added 
to process and circuit improvements to contain the growth in power. It therefore seems 
reasonable that power should be dealt with at the same stage in the design process as 
architectural trade-offs 

In the remainder of the paper we will expand on this theme starting, in the next sec- 
tion, with a simple model of power consumption for CMOS logic. 

2 Power Equations for CMOS Logic 

There are three equations that provide a model of the power-performance trade-offs for 
CMOS logic circuits. The equations are frequently discussed in the low-power literature 
and are simplifications that capture the essentials for logic designers, architects, and 
systems builders. We focus on CMOS because it will likely remain the dominant tech- 
nology for the next 5-7 years. In addition to applying directly to processor logic and 
caches, the equations are also relevant for some aspects of DRAM chips. The first equa- 
tion defines power consumption: 

P = Atff + lAVI shor f+VI leak (1) 

There are three components to this equation. The first is perhaps the most familiar. It 
measures the dynamic power consumption caused by the charging and discharging of 
the capacitive load on the output of each gate. It is proportional to the frequency of the 
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operation of the system,/, the activity of the gates in the system, A (some gates may not 
switch every clock), the total capacitance seen by the gate outputs, C, and the square of 
the supply voltage, V. The second term captures the power expended due to the short- 
circuit current, / . , , that momentarily, x, flows between the supply voltage and 

ground when the output of a CMOS logic gate switches. The third term measures the 
power lost due to leakage current that is present regardless of the state of the gate. 

In today's circuits the first term dominates, immediately suggesting that the most 
effective way to reduce power consumption is to reduce the supply voltage, V. In fact, 
the quadratic dependence on V means that the savings can be significant: halving the 
voltage will reduce the power consumption to one quarter of it original value. Unfortu- 
nately, this comes at the expense of performance, or more accurately maximum operat- 
ing frequency, as the next equations shows: 

fmax~W- Vfhreshol/ /V ( 2 ) 

In other words, the maximum frequency of operation is roughly linear in V. Reducing 
it will limit the circuit to a lower frequency. But reducing the power to quarter of its 
original value will only cut the maximum frequency in half. There is an important cor- 
ollary to equations (1) and (2) that has been widely noticed: If a computation can be split 
in two and run as two parallel independent tasks, this form of parallel processing has the 
potential to cut the power in half without slowing the computation. 

We can lessen the effect of reducing Tin (2) by reducing V thresho i d . In fact this must 
occur to allow proper operation of low voltage logic circuits. Unfortunately, reducing 
V threshold increases the leakage current, as the third equation shows: 

beak* ex P(- y f / ! r es / !0 //( 35my )) ( 3 ) 

Thus, this is a limited option for countering the effect of reducing V. It makes the leak- 
age term in the first equation appreciable. 

Summary: The are three important points that can be taken away from the above mod- 
el. They are: 

1. Reducing voltage has a significant effect on power consumption — P ° <= V 2 . 

2. Reducing activity can too — in the simplest case this means turning off parts of 
the computer that are not being used. 

3. Parallel processing is good if it can be done efficiently — independent tasks are 
ideal. 

2.1 Other Figures of Merit Related to Power 

The equation for power consumption given in ( 1 ) is an average value. There are two im- 
portant cases where more information is needed. The first is peak power. Typically, sys- 
tems have an upper limit, which if exceeded will lead to some form of damage — aver- 
age power does not account for this. The second is dynamic power. Sharp changes in 
power consumption can result in inductive effects that can result in circuit malfunction. 
The effect is seen in “di/dt noise” — again, average power does not account for this. 

The term “power” is often used quite loosely to refer to quantities that are not really 
power. For example, in the case of portable devices the amount of energy used to per- 
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form a computation may be a more useful measure, because a battery stores a quantity 
of energy, not a quantity of power. To refer to a processor as being “lower power” than 
another may be misleading if the computation takes longer to perform. The total energy 
expended may be the same in both cases — the battery will be run down by the same 
amount in both cases 1 . This leads to the idea of energy/operation. A processor with a 
low energy/operation is sometimes incorrectly referred to as “low power.” In fact, the 
inverse of this measure, MIPS/W is frequently used as a figure of merit for processors 
intended for the mobile applications [3], 

Although the MIPS/W is widely used as a figure of merit (higher numbers are bet- 
ter), it too can be misleading because cutting the power consumption in half reduces the 
frequency of operation by much less, because of the quadratic term in (1), as we have 
discussed. This has lead Gonzalez and Horowitz to propose a third figure of merit: en- 
ergy*delay. This measure takes into account that, in systems whose power is modelled 
by (1), it is possible to trade a decrease in speed for higher MIPS/W. Unfortunately, the 
bulk of the literature uses MIPS/W or simply Watts, so we will continue this convention 
recognizing that occasionally it may suggest misleading trade-offs where “quadratic” 
devices like CMOS are concerned. Finally, it should be noted that if the computation 
under consideration must finish by a deadline, slowing the operation may not be an op- 
tion. In these cases a measure that combines total energy with a deadline is more appro- 
priate. 

3 Techniques for Reducing Power Consumption 

In this section we will survey some of the most common techniques that systems de- 
signers have proposed to save power. The scope of this brief overview includes logic, 
architecture and operating systems. 

3.1 Logic 

There are a number of techniques at the logic level for saving power. The clock tree 
can consume 30% of the power of a processor — the early Alpha 21064 exceeded this. 
Therefore, it is not surprising that this is an item where a number of power saving tech- 
niques have been developed. 

3.1.1 Clock Gating 

This is a technique that has been widely employed. The idea is to turn off those parts of 
the clock tree to latches or flip-flops that are not being used. Until a few years ago gated 
clocks were considered poor design practice, because the gates in the clock tree can ex- 
acerbate clock skew. However, more accurate timing analyzers and more flexible de- 
sign tools have made it possible to produce reliable designs with gated clocks, and this 
technique is no longer frowned upon. 



1. This is a simplification because the total energy that can be drawn from a bat- 
tery after it has been charged depends, to some extent, on the rate at which the 
energy is drawn out. We will ignore this effect for our simplified analysis. 
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3.1.2 Half-Frequency/Half-Swing Clocks 

The idea with the half-frequency clock is to use both edges of the clock to synchronized 
events. The clock can then run at half the frequency of a conventional clock. The draw- 
backs are that the latches are much more complex and occupy more area, and the re- 
quirements on the clock are more stringent. 

The half-swing clock also increases the requirements on latch design, and it is dif- 
ficult to employ in systems where V is low in the first place. However, the gains from 
lowering clock swing are usually greater than for clocking on both edges. 

3.1.3 Asynchronous Logic 

The proponents of asynchronous logic have pointed out that their systems do not have 
a clock and therefore stand to save the considerable power that goes into the clock tree. 
However, there are a drawbacks with asynchronous logic design. The most notable is 
the need to generate completion signals. This means that additional logic must be em- 
ployed at each register transfer — in some case a double rail implementation is em- 
ployed. Other drawbacks include difficulty of testing and absence of design tools. 

There have been several projects to demonstrate the power saving of asynchronous 
systems. The Amulet, an asynchronous implementation of the ARM instruction set ar- 
chitecture, is one of the most successful [5], It is difficult to draw definitive conclusions 
because it is important to compare designs that are realized in the same technologies. 
Furthermore, the asynchronous designer is at a disadvantage, because, as noted, today’s 
design tools are geared for synchronous design. In any case, asynchronous design does 
not appear to offer sufficient advantages for there to be a wholesale switch to it from 
synchronous designs. 

The area where asynchronous techniques are likely to prove important is in globally 
asynchronous, locally synchronous systems (GALS). These reduce clock power and 
help with the growing problem of clock skew across large chips, while still allowing 
conventional design techniques for most of the chip. 

3.2 Architecture 

The focus of computer architecture research, typified by the work presented at Interna- 
tional Symposia on Computer Architecture and the International Symposia on Micro- 
architecture, has been on high performance. There have been two important themes pur- 
sued by this research. One has been to exploit parallelism, which we have seen can help 
reduce power. The other is to employ speculation — computations are allowed to pro- 
ceed beyond dependent instructions that may not have completed. Clearly, if the spec- 
ulation is wrong, energy has been wasted executing useless instructions. Branch predic- 
tion is perhaps the best known example of speculation. If there is a high degree of con- 
fidence that the speculation will be correct, then it can provide an increase in the MIPS/ 
W figure. However, for this to be the case, the confidence level must often be so high 
that speculation is rarely employed as a means to reduce MIPS/W or power [7], 

The area where new architectural ideas can most profitably contribute to reducing 
power is in reducing the dynamic power consumption term, specifically the activity fac- 
tor, A, in (1). 
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3.2.1 Memory Systems 

The memory system is a significant source of power consumption. For systems with rel- 
atively unsophisticated processors, cache memory can dominate the chip area. There 
are two sources of power loss in memory systems. First there is the dynamic power loss, 
due to the frequency of memory access. This is modelled by the first term 1 in (1). The 
second is the leakage current — the third term in (1). 

There have been several proposals for limiting the dynamic power loss in memories 
by organizing memory so that only parts of it are activated on a memory access. Two 
examples are the filter cache and memory banking [8]. The filter cache is a small cache 
placed in front of the LI cache. Its purpose is to intercept signals intended for the main 
cache. Its hit rate does not have to be very high, and its access time need be no faster 
than the LI cache. Even if it is hit only 50% of the time, then the power saved is half 
the difference between activating the main cache and the filter cache. This can be sig- 
nificant. The second example, memory banking, is currently employed in some low 
power designs. The idea is to split the memory into banks and activate only the bank 
presently in use. It relies on the reference pattern having a lot of spatial locality, and thus 
is more suitable for instruction cache organization. 

There is not much that can be done by the architect or systems designer to limit leak- 
age, except to shut the memory down. This is only practical if the memory is going to 
be unused for a relatively long time because it will lose state and therefore must be 
backed up to disk. This type of shut down (often referred to as sleep mode) is usually 
handled by the operating system. 

3.2.2 Buses 

Buses are a significant source power loss. This is especially true for inter-chip buses. 
These are often very wide — the standard PC memory bus includes 64 data line and 32 
address lines. Each require substantial drivers. It is not unusual for a chip to expend 15- 
20% of its power on these inter-chip drivers. 

There have been several proposal for limiting this swing. One idea is to encode the 
the address lines into a Gray code. The reasoning is that address changes (particularly 
from cache refills) are often sequential, and counting in Gray code switches the least 
number of signals [10]. 

It is straightforward to adapt other ideas to this problem. Transmitting the difference 
between successive address values achieves a similar result to the Gray code. More gen- 
erally, it has been observed that the address lines can be reduced by compressing the 
information in them [9]. These techniques are best suited to inter-chip signalling, be- 
cause the encoding can be integrated into the memory controllers. 

Continuing with the code compression concept, it has been shown that significant 
instruction memory savings results if the program is stored in compressed form and de- 
compressed on the fly (typically on a cache miss) [11]. The reduction in memory size 
can translate to power savings. It also reduces the frequency of code overlays, another 
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source of power loss, and a technique still used in many digital signal processing (DSP) 
systems. 



3.2.3 Parallel Processing and Pipelining 

As we noted above, a corollary of our power model is that parallel processing can be a 
important technique for reducing power consumption in CMOS systems. Pipelining 
does not share this advantage, because the concurrency in pipelining it achieved 
through increasing the clock frequency which limits the ability to scale the voltage (2). 
This is an interesting reversal, because the microarchitecture for pipelining is simpler 
than for parallel instruction issue and therefore it has traditionally been the more com- 
mon of the two techniques employed to speed up execution. 

The degree to which computations can be parallelized varies widely. Some are “em- 
barrassingly parallel.” They are usually characterized by identical operations on array 
data structures. Flowever, for general purpose computations typified by the SPEC 
benchmark suite there has been little progress on discovering parallelism. This is re- 
flected in the fact that successful general purpose microprocessors rarely issue more 
than three or four instructions at once. Increasing instruction level parallelism is not 
likely to offset the loss due to hazards inhibiting efficient parallel execution. However, 
it is likely that future desktop architectures will have shorter pipes. 

In contrast, common signal processing algorithms often possess a significant degree 
of parallelism. This is reflected in the architecture of DSP chips, which is notably dif- 
ferent from desktop or workstation architectures. DSPs typically run at much lower fre- 
quencies and exploit a much higher degree of parallelism. Parallelism and direct sup- 
port for a multiply-accumulate (MAC) operation, which occurs with considerable fre- 
quency in signal processing algorithms, means that in spite of their lower clock rates 
they can achieve high MIPS ratings. An example is given by Analog Devices 21160 
SHARC DSP. It can achieve 600 Mflops using only 2W on some DSP kernels. 

3.3 Operating System 

The quadratic voltage term in (1) means that reducing voltage has great benefit for pow- 
er savings, as we have noted several times. A processor does not have to run at it max- 
imum frequency all the time to get its work done. If the deadline of a computation is 
known it may be possible to adjust the frequency of the processor and reduce the supply 
voltage. For example, a simple MPEG decode runs at a fixed rate determined by the 
screen refresh (usually once every l/30th of a second). A processor responsible for this 
work could be adjusted to run so that it does not finish ahead of schedule and waste 
power. 

It is very difficult to automatically detect periods where voltage can be scaled back, 
so current proposals are to provide an interface to the operating system that allows it to 
control the voltage. The idea is for the application to use these operating system func- 
tions to “schedule” its voltage needs [12]. This is a very effect way to save power be- 
cause it works directly on the quadratic term in equation (1). Support for voltage scaling 
has already found its way into the next generation of StrongARM microprocessor from 
Intel, the XScale. 
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4 What Can We Do with a High MIPS/W Device? 

The obvious applications for processors with a high MIPS/W are in mobile computing. 
The so-called 3G mobile phones that we can expect in the near future will have to pos- 
sess remarkable processing capabilities. The 3G phones will communicate over a pack- 
et-switched wireless link at up to 2 Mbs. The link will support both voice and data and 
will be always be connected within areas that have 3G service. There are plans to sup- 
port MPEG4 video transmission as well as other data intensive applications. 

Today's cell phones are built around a two processors: a general purpose computer 
and a DSP engine. Both have to be low power, the lower the better. A common solution 
is to use an ARM processor for the general purpose machine and a Texas Instruments 
DSP chip. For 3G systems both of these processors will have to be much more powerful 
without sacrificing battery life. In fact, the processing requirements, given the power 
constraints, are beyond the present state-of-the-art. The design of such systems where 
power is a first class design constraint will be one of the next major challenge for com- 
puter architects. Furthermore, the immense number of units that will be sold — there 
are hundreds of millions of cell phone in use today — means that this platform will take 
over from the desktop as the defining application environment for computers as a 
whole. 

The two-processor configuration of the cell platform has arisen out of the need to 
have a low power system that can perform significant amounts of signal processing and 
possess general purpose functionality for low-resolution display support, simple data 
base functions, and the protocols associated with cell-to-phone communications. From 
an architectural point of view this is not a particularly elegant solution and a “conver- 
gent” architecture that can handle the requirements of both signal processing and gen- 
eral purpose computing may be a cleaner solution. However, from a power perspective 
it may be easier to manage the power to separate components; either one can be easily 
turned off when not required. There are many more trade-offs to be studied. 

While the cell phone and its derivatives will become the leading user of power ef- 
ficient systems, it is by no means the only place where power is key, as we saw in the 
introduction. To go back to the server farm example, consider one of the servers: It has 
a workload of independent programs. Thus parallelism can be used without the ineffi- 
ciencies often introduced by the need for intra-program communication and synchroni- 
zation — multiprocessors are an attractive solution. A typical front-end servers that 
handles mail, web pages, and news, has a Intel compatible processor, 32M bytes of 
memory, an 8G byte disk and requires about 30W of power. Assume the processor is an 
AMD Mobile K6 with a total of 64K bytes of cache running at 400MHz. It is rated at 
12W (typical). Compare this to the recently announced Intel XScale, which is Intel’s 
next generation of StrongARM processor. It has the same total cache size but consumes 
only 450mW at 600MHz. (It can run from about 1GHz to below 100MHz; at 150Mhz 
is consumes just 40mW.) If we replace the K6 with 24 XScales we have not increased 
power consumption. For the K6 to process as many jobs as the 24-headed multiproces- 
sor, it will have to have an architectural efficiency (e.g., SPECmarks) that is about 24 
times that of an XScale. 
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There is a lot to disagree with in the above analysis. For example, the much more 
complex processor-memory interconnect required by the multiprocessor is not account- 
ed for, nor has any consideration been given to the fact that the individual jobs may have 
an unacceptable response time. However, the point is to show that if power is the chief 
design constraint, then a low power but non-trivial processor like the XScale can intro- 
duce a new perspective into computer architecture. Consider the above analysis if we 
replaced the K6 with a 100W Compaq 21364: it would need to be 200 times as efficient. 

5 Conclusion 

We have made an argument for power to be a first class design constraint. We have also 
listed a number of ways that systems designers can contribute to satisfying this con- 
straint. There is much more research to be done. Power aware design is no longer ex- 
clusively the province of the process engineer and circuit designer, although their con- 
tributions are crucial. By one account (see the talk by Deo Singh in [1]) we need archi- 
tects and systems level designers to contribute a 2x improvement in power consumption 
per generation. This improvement is required in addition to the gains from process and 
circuit improvements. 

To elevate power to a first class constraint it needs to be dealt with early on in the 
design flow, at the same point that architectural trade-offs are being made. This is the 
point where cycle accurate simulation is performed. This is problematic because accu- 
rate power determination can only be made after chip layout has been performed. How- 
ever, very approximate values are usually acceptable early on in the design flow pro- 
vided they accurately reflect trends. In other words, if a change is made in the architec- 
ture the approximate power figure should reflect a change in power that is in the correct 
direction. 

Several research efforts are under way to insert power estimators into cycle level 
simulators. They typically employ event counters to obtain frequency measures for 
components of the architecture. These components are items such as adders, caches, de- 
coders, and buses, for which an approximate model of power can be obtained. An early 
example was developed by researchers at Intel [13]. Others are Wattch [14] and Sim- 
plePower [15]. All three are based on the SimpleScalar simulator that is widely used in 
academe [16]. A fourth effort, PowerAnalyzer, under development by the author, T. 
Austin and D. Grunwald is expanding on the work in [13]. PowerAnalyzer will also pro- 
vide estimates for di/dt noise and peak power [17]. 
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Abstract. With the advent of system level integration (SLI) and system-on- 
chip (SOC), the center of gravity of the computer industry is moving from per- 
sonal computing into embedded computing. The opportunities, needs and con- 
straints of this next generation of computing are somewhat different from those 
to which we have got accustomed in general-purpose computing. This will lead 
to significantly different computer architectures, at both the system and the 
processor levels, and a rich diversity of off-the-shelf and custom designs. Fur- 
thermore, we predict that embedded computing will introduce a new theme into 
computer architecture: automation of architecture. In this paper, we elaborate 
on these claims and provide, as an example, an overview of PICO, the archi- 
tecture synthesis system that the authors and their colleagues have been devel- 
oping over the past five years. 



1 Introduction 

Over the past few decades, driven by ever increasing levels of semiconductor inte- 
gration, the center of gravity of the computer industry has steadily moved down from 
the mainframe price bracket to the personal computer price bracket. Now, with the 
advent of system level integration (SLI) and system-on-chip (SOC), the center of 
gravity is moving into embedded computing. 

Embedded computers hide within products designed for everyday use as they assist 
our world in an invisible manner. We refer to such products as smart products. Em- 
bedded computers have been incorporated into a broad variety of smart products such 
as video games, digital cameras, personal stereos, televisions, cellular phones, and 
network routers. A digital camcorder, for instance, uses a high-performance embed- 
ded processor to record or playback a digital video stream of data. These embedded 
processors achieve supercomputer levels of performance on highly specific tasks 
needed for recording and playback. It is the availability of high-density VLSI inte- 
gration that makes it practical to provide such product-defining performance at an 
affordable cost. 

As VLSI density increases embedded processors continue to provide more com- 
pute power at even lower cost. This is stimulating the rapid introduction of a vast 
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array of innovative smart products. Newly defined digital solutions, capable of inex- 
pensively performing complex data manipulations provide revolutionary improve- 
ments to product functionality. Embedded processors are often used in products that 
previously relied on analog circuitry. When analog signals are digitally represented, 
digital processing performance increases can be used to provide higher speed, higher 
accuracy signal representation, more efficient storage, and more sophisticated proc- 
essing uniquely available in the digital domain. 



1.1 Product Requirements 

Demanding product requirements often constrain the design of embedded systems 
from many sides. A smart product may simultaneously need, higher performance, 
lower cost, lower power, and higher memory capacity. High throughput is especially 
important in data-intensive tasks such as imaging, video, signal processing, etc. Prod- 
ucts often have demanding real time requirements that further exaggerate processing 
needs. In these cases, embedded processors are required to perform complex proc- 
essing steps in a limited and precisely known amount of time. Compute-intensive 
smart products use special-purpose processing engines to deliver very high perform- 
ance that cannot be achieved with a general-purpose processor and software. Such 
products are often enabled by non-pro grammable accelerators ( NPAs ) that accelerate 
performance critical functions. 

Often, cost is more important than performance, with budgets that allow only a few 
cents for critical chips. Power is obviously important for battery-operated devices, but 
can be equally important in office environments where densely stacked equipment 
requires expensive cooling. In such settings, long term operating costs can easily 
exceed the purchase price of equipment. Lack of memory storage is a serious problem 
and compression techniques are used to conserve storage especially for high-volume 
video, image, and audio data. The storage of large computer programs is often too 
costly and processors are valued for their small code size. In some systems, programs 
are compressed to reduce their size and later decompressed for execution. 

While the speed of general-purpose processors may exceed 1GHz, embedded 
processors often execute at modest clock speeds relying instead on parallelism to 
achieve needed performance. The use of parallel execution, rather than high clock 
speed, allows for designs with cheaper, lower power circuits. These low cost and 
power-efficient designs are also less dependent on precisely tuned low level circuitry. 
Additional power management features, such as gated or variable speed clocks, may 
also be used. 



1.2 Market Requirements 

Smart products require newly developed, and highly specialized, embedded systems 
in order to provide the novel and product-defining features necessary in a competitive 
marketplace. Often, the introduction of a new smart product depends upon the suc- 
cessful design and fabrication of a new chip that gives the product its distinguishing, 
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high value features. Time-to-market determines the success or failure of products and 
businesses. While one smart product vendor ( SPV) awaits a tardy chip design before 
he can ship product, a competing SPV’s product takes advantage of the higher profit 
margins and visible press available during early product introduction. The SPVs who 
deliver product early capture the lion’s share of the profit and visibility. Those whose 
products are late, miss the market window and must settle for what is left. The ability 
to design rapidly provides a distinct market advantage. 

There is a proliferation in the number of smart products being introduced, and an 
escalating number of embedded system designs are needed to support them. The 
problem is exacerbated by the shorter life cycles experienced by smart products in the 
consumer space, since each generation of a rapidly improving smart product requires 
new embedded systems in order to improve functionality. 

The need for complex new chip designs cannot currently be satisfied. Product in- 
novation is limited by our ability to perform expensive and time-consuming design 
tasks. A design bottleneck results from the use of a limited pool of highly talented 
engineers who must design a greater variety of complex chips at ever increasing rates. 



1.3 Application Characteristics 

Embedded applications are characterized by small well-defined workloads. Embed- 
ded systems often are used within single function products. Such products have prod- 
uct-specific and portable forms with simple and intuitive user interfaces. Single func- 
tion products are based on far simpler software than general-purpose computers. They 
do not have the system crashes and reboots that we commonly associate with general- 
purpose systems. Their administration is simple and computer-illiterate users can 
enjoy their use. Single-function products with complex, high-performance, embedded 
systems are increasingly popular as the cost of their electronic content decreases. 
Many users do not, and cannot, write programs for their products. To them, programs 
are invisible logic which is used to provide product-defining functions. 

Often, key kernels within applications represent the vast majority of required com- 
pute cycles. Kernels consist of small amounts of code which must run at very high 
performance levels to enable a smart product’s functionality. In video, image and 
digital signal processing, the applications’ execution times are often dominated by 
simple loop nests with a high degree of parallelism. These applications are typically 
characterized by a greater degree of determinacy then general-purpose applications. 
Not only is the nature of the application fixed, but key parameters such as loop trip 
counts or array sizes are pre-determined by physical product parameters such as the 
size of an imaging sensor. For such demanding application-specific products, high 
performance and high efficiency is often obtained using deeply pipelined function 
units which have been crafted into custom architectures designed to solve kernel 
codes. 

Products may also have real-time constraints where time-critical computing steps 
must be completed before well understood deadlines to prevent system failure. Real- 
time systems require highly predictable execution performance. This often requires 
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the use of a real-time operating system which provides task scheduling mechanisms 
necessary to guarantee predictable performance. Complex dynamic techniques, such 
as the use of virtual memory, caches, or dynamic instruction scheduling may not be 
allowed. When these techniques make the accurate prediction of performance exces- 
sively difficult, real-time constraints cannot be verified. 



1.4 Overview of this Paper 

The movement of the center of gravity of computing, from general-purpose personal 
computers to special-purpose embedded computers, will cause a major upheaval in 
the computer industry. Successful computer architecture has always resulted from a 
judicious melding of the opportunities afforded by the latest technologies with the 
requirements of the market, product and application. These are significantly different 
for embedded computing leading to substantially different computer architectures, at 
both the system and the processor levels, as well as a rich diversity of off-the-shelf 
(OTS) and custom designs. Furthermore, we believe that embedded computing will 
introduce a new theme into computer architecture: the automation of computer archi- 
tecture. 

In the rest of this paper, we elaborate on these points. In Section 2, we look at em- 
bedded computer architectures, and explain the increased emphasis on special- 
purpose architectures over general-purpose ones. Section 3 considers the issue of 
customization and the circumstances that force a SPV to resort to it instead of using 
an OTS solution. In Section 4 we argue for the use of automation in architecting and 
designing embedded systems, and we articulate our philosophy for so doing. As an 
example of this philosophy, Section 5 provides an overview of PICO, the architecture 
synthesis system that the authors and their colleagues have been developing over the 
past five years. 



2 Embedded Computer System Architecture 

To achieve the requisite performance, high-performance embedded systems often 
take a hierarchical form. The system consists of a network of processors; each proces- 
sor is devoted to its specialized computing task. Processors communicate as required 
by the application and network connections among processors support only the re- 
quired communications. Each processor also provides parallelism and is constructed 
using networks of arithmetic units, memory units, registers, and control logic. Hierar- 
chical systems jointly offer both the process-level parallelism achieved with multiple 
processors as well the instruction-level parallelism achieved by using multiple func- 
tion units within each processor. 

For efficient implementation, and to provide product features such as highest per- 
formance, lowest cost, and lowest power, embedded computer system designs are 
often irregular at both levels of the design hierarchy. The entire system, as well as 
each of the processors, represents a highly special-purpose architecture that shows a 




Embedded Computing: New Directions in Architecture and Automation 229 



strong irregularity that closely mirrors application requirements and ideally supports 
kernel needs. 

Whereas specialization can be used to provide the very highest performance at the 
very lowest cost, embedded systems employ architectures which span a spectrum. At 
one end, a special-purpose architecture provides very high performance with very 
good cost-performance, but little flexibility. At the other end of the spectrum, a gen- 
eral-purpose architecture provides much lower performance and poorer cost- 
performance, but with all of the flexibility associated with software programming. A 
third choice, the OTS customizable architecture, provides an increasingly important 
compromise between these two extremes. 



2.1 Special-Purpose Architectures 

Smart products often incorporate special-purpose embedded processing systems in 
order to provide high performance with low cost and power. For example, a printer 
needs to perform multiple processing steps on its input data e.g.: error correction, 
decompression, image enhancement, coordinate transformation, color correction, and 
final rendering. Enormous computation is needed to perform these steps. To achieve 
the needed performance, printers often use a pipeline consisting of multiple proces- 
sors; data is streamed through this pipeline of concurrently executing special-purpose 
processors. This style of design produces relatively inexpensive and irregular system 
architectures that are specialized to an application’s needs. 

In performance- and cost-critical situations, these embedded computer systems are 
specialized to simplify circuitry. Enormous savings can be achieved by specializing 
high-bandwidth connections among processors. Connections of appropriate band- 
width are provided between processors, exactly as needed, to accommodate very 
specific communication needs. While high bandwidth data paths are provided among 
processors that must exchange a large volume of data, no data paths are provided 
among processors that never communicate. 

Likewise, the design of each processor is also specialized to the specific needs of 
the task that it performs. If we look within the design of each processor, we again see 
that specialization greatly simplifies needed circuitry. Each processor’s performance, 
memory, and arithmetic capability are all adjusted to exactly match its dedicated task 
needs. Arithmetic units and registers are connected with a network of data paths that 
is specific to task needs. Each arithmetic unit, data path, or register is optimized to 
exact width requirements dictated by the statically known arithmetic precision of the 
operations and operands that they support. This specialization process again elimi- 
nates substantial amounts of unnecessary circuitry. 

The control circuitry for each special-purpose processor can also be specialized to 
exact task requirements. Control circuits often degenerate to simple state machines 
which are highly-efficient in executing simple dedicated tasks. RAM structures within 
each processor are distributed according to need. Special table look up RAMs may be 
connected directly to the arithmetic units which use their operands. Each RAM is 
minimal in size in both number and width of its words. Rather precise information 
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about the application is used to squeeze out unnecessary arithmetic, communication, 
and storage circuitry. Chained sequences of arithmetic, logical, and data-transfer 
operations are statically optimized to squeeze out additional circuitry. These opti- 
mized circuits perform multiple operations using far less logic than would be required 
if cascaded units were designed to execute each operation separately. 



2.2 General-Purpose Architectures 

The widespread use of special-purpose architectures sharply increases the number of 
distinct architectures that must be designed. If satisfactory programmable general- 
purpose system architectures existed, they would eliminate the need to specialize 
architectures to specific applications. Such general-purpose and domain-specific 
systems would offer the hope that new smart products could be designed using OTS 
parts that are reusable across many new smart products. General-purpose systems are 
highly flexible and are, in fact, reusable across almost all low-performance applica- 
tions. General-purpose systems are designed in a number of ways. A general-purpose 
RISC processor can be re-programmed for a large variety of tasks. Domain-specific 
systems are customized to specific application areas (e.g. digital signal processing) 
but not to a specific smart product or application. Domain- specific systems often 
incorporate a control processor to run an operating system and a digital signal proces- 
sor (DSP) to accelerate signal processing kernel code. Too often, however, general- 
purpose and domain-specific systems are not are not able to meet demanding embed- 
ded computing needs. RISC, superscalar, VLIW, and DSP architectures do not effi- 
ciently scale beyond their respective architectural limits. They have limits in clock 
speed and ability to exploit parallelism that make them costly and impractical at the 
highest levels of performance. 

Symmetric multiprocessors are commonly used to extend system performance 
through the addition of more processors. Because general-purpose multiprocessors 
are designed without application knowledge, they provide uniform communication 
among processors. Identical processors are connected in a symmetric manner. Gen- 
eral-purpose interconnection networks and multiprocessor shared memories provide 
the required general and parallel access. However, such highly connected and sym- 
metric hardware scales poorly as the number of processors is increased. The commu- 
nication hardware is either over-designed, permitting high-volume data transfers that 
never occur, or under-designed and unable to handle those that do. For large numbers 
of processors, the interconnect requires large amounts of chip area and the long 
transmission delays adversely affect the cycle time. 

General-purpose systems rely on general-purpose processors for their computing 
horsepower. Each general-purpose processor uses flexible, general-purpose, function 
units (FUs) that can execute any common operation that might be needed in an arbi- 
trary program. The general-purpose FU, however, is very expensive when compared 
to specialized function units (within custom processors) that execute only a single 
operation type. General-purpose processors often require symmetric access between 
function units and registers or RAMs. They use expensive and slow multi-ported 
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register files and multi-ported RAMs to allow general (and parallel) communication 
and storage. Multi-ported register files and multi-ported RAMs do not scale and proc- 
essors retain their efficiency at only modest levels of parallelism. Since the general- 
purpose processor is expected to be capable of executing a broad range of applica- 
tions, it ends up being both over-designed and, often, under-designed for the specific 
aplication it is called upon to execute in the embedded system. 

The control for a general-purpose processor is based on instructions that require 
wide access, complex shifting, and long distance transfer. The unpacking of the com- 
plex instruction formats needed to reduce code size can be very complex. This prob- 
lem is especially difficult for wide-issue superscalar or VL1W machines. Their in- 
struction units are well designed for supporting arbitrary and large programs on ma- 
chines of modest issue width. But, for simple and highly parallel tasks, they are too 
expensive when compared to state-machine based controllers found in special- 
purpose systems. 



2.3 Off-the-Shelf, Customizable Systems 

In many settings, general-purpose and domain-specific OTS processor chips cannot 
deliver adequate computing power. Product designers seek other OTS chip architec- 
tures to meet high-performance needs. Field Programmable Gate Arrays (FPGAs) 
[12] provide one alternative approach. FPGAs use programmable logic cells inter- 
connected with a network of wires and programmable switches. Rather than relying 
on normal sequential programming, FPGAs are programmed using hardware design 
techniques. FPGAs have traditionally been used to implement simple control logic 
and “glue logic” — the left over logic needed to glue key components together. An 
FPGA allows such logic to be collected within a single chip to reduce system cost. 
The density of FPGAs has grown to the point where complex data paths are now 
possible on a single FPGA. The architecture of FPGAs continues to improve as fea- 
tures are added to support wider data paths, wider arithmetic, and substantial amounts 
of on-chip local memory. With these improvements, FPGAs are increasingly used to 
implement special-purpose, high-performance processors. 

The data path and control of an embedded architecture can be specified as a circuit 
or as network of hardware functions. The circuit can execute operations from a rep- 
ertoire of memory and logic operations that are supported directly, or through librar- 
ies, in OTS FPGA hardware. The embedded system circuit can be carefully special- 
ized to application needs. After completing logic design, the circuit is then mapped 
onto an FPGA. The simplified logic diagram is mapped onto logic cells and placed 
and routed within OTS FPGA hardware. 

While not as easy as software programming, embedded system design using 
FPGAs eliminates expensive, risky, and slow chip design efforts and substantially 
decreases product risk and time-to-market. Because they are programmable, the cost 
of fixing bugs in FPGA-based systems is small. A fix can be quickly tried and 
shipped. Firmware downloads can be used to fix FPGA related problems. FPGAs 
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offer an increasingly important programming paradigm for delivering high- 
performance processing and rapid time to market. 

Due to inherent hardware costs for supporting programmable logic, FPGAs cannot 
hope to be as efficient as customized hardware. Though FPGAs may be an order of 
magnitude less efficient than customized hardware, FPGA-based designs of high- 
performance processors can be far more efficient than designs that rely solely on 
general-purpose processors for computing horsepower. 

FPGA vendors now provide OTS chips that contain a general-purpose processor, 
FPGA, and RAM hardware in a single SOC. These FPGA-based architectures permit 
the design of complex special-purpose processing systems. The general-purpose 
processor provides flexible low-performance computing while the FPGA, along with 
its associated configurable RAM blocks, allows high-performance special-purpose 
processors to be implemented as programmable logic. FPGA libraries now also pro- 
vide processor cores that are programmed as FPGA logic. These processors can be 
modified or enhanced for specific application needs. Other functionality, such as 
support for peripheral interfaces and programmable I/Os, further increase FPGA 
utility. 



3 System Customization 

By customization we mean the process of taking an OTS system and modifying it to 
meet one’s requirements. In its extreme form, it entails designing the system from 
scratch. At one level, customization is quite commonplace. For instance when one 
buys a personal computer, one typically configures the amount of memory, disk, and 
the set of devices connected to the peripheral bus. But one never modifies the proces- 
sor or the system architecture (e.g., its bus structure). That task is viewed as the do- 
main of the semiconductor or computer manufacturer from whom we expect to buy 
an OTS system. Our discussion of customization is focused on this part of the overall 
embedded system, what we will refer to as the central computing complex (CCC), 
which is the set of processors, memories and interconnect involved in executing the 
embedded application. 

Customization of the CCC increases the design cost of smart products and often 
delays product introduction. Whereas it can greatly simplify the system, allowing 
high performance at low cost, it requires a complicated design process. A customized 
system involves complex tasks, including architecture, logic, circuit, and physical 
design. Designs must be verified and masks fabricated before chip production can 
begin. The SPV is well advised to use an OTS CCC when possible; from the view- 
points of time-to-market, engineering effort and project risk, this is clearly the prefer- 
able approach. 

And yet, SPVs routinely design custom CCCs, processors and accelerators. What 
are the circumstances that force them to do so? Why is it that what the SPV needs is 
not available OTS, and that using something OTS would fall far short of his needs? 
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3.1 Why Customize? 

Our view is that three conditions, in conjunction, create the situation that forces a 
SPV to have to customize his CCC. Firstly, the smart product must have challenging 
requirements which can only be met by specializing the system or processor archi- 
tecture, as described in Section 2. Else, the SPV could just use an OTS general- 
purpose design. Secondly, for the given application, the performance, perhaps even 
the usability, of designs within the space of meaningful, special-purpose designs must 
be very disparate. For this particular application, all the OTS designs must fall short 
to such an extent that it is worth the SPV’s while to pay the costs of customization: 
longer time-to-market, greater engineering effort and increased project risk. Lastly, 
the space of worthwhile special-purpose designs must be so large, that it is not possi- 
ble, or not economical, for someone to make them all available, on an OTS basis. 

This last criterion raises a further question. If it was worth the SPV’s while to cre- 
ate a custom design, why was it not so for a manufacturer of OTS designs to have 
designed and offered the same thing? There are at least three reasons that serve as 
explanations. The most frequent reason is that the application, or a portion of it, rep- 
resents the product-defining functionality that provides competitive differentiation to 
the smart product, i.e., it incorporates algorithms that are proprietary. If the perform- 
ance requirements on these algorithms are sufficiently demanding, the CCC architec- 
ture must be specialized to reflect these proprietary algorithms; it must be customized. 
Secondly, the unit volume represented by a given smart product may be too low to 
make it worthwhile for a supplier to provide an OTS system. A final possibility, is 
that the diversity of special-purpose solutions demanded by SPVs is just too large for 
every one of them to be provided as OTS solutions, even though neither of the other 
two reasons is applicable. The SPV must fend for himself. 



3.2 Customization Strategies 

Customization incurs two types of design costs: architectural design cost and physical 
design cost. The former includes the design costs associated with architecting the 
custom system and any custom processors that it may contain, performing hardware- 
software partitioning, logic synthesis, design verification and, subsequently, system 
integration. Physical design cost includes the design costs associated with the floor- 
planning, placement and routing, as well as the cost of creating a mask set. 

While customization can be used to provide the very highest performance at the 
very lowest cost, SPVs employ a variety of strategies which span a spectrum. At one 
end, a custom architecture provides very high performance, but at very high archi- 
tectural and physical design costs. At the other end of the spectrum, an OTS archi- 
tecture provides much lower performance with the low design costs associated with 
software programming. A third choice, the OTS customizable system, provides an 
increasingly important compromise: the ability to use OTS parts, but requiring FPGA 
programming that is more complex and similar to physical design. Although the ar- 
chitectural cost and a part of the physical design cost must be borne, the mask set cost 
is avoided. 
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Minimizing architectural design cost. The key to minimizing architectural design 
cost is to reuse pre-existing designs to the extent possible. One strategy is to fix the 
system-level architecture, but to customize at the subsystem level. For instance, the 
system architecture may consist of a some specific interconnect topology, containing 
one or more specific OTS microprocessors plus one or more unspecified accelerators. 
The accelerators are defined by the application and must, therefore, be custom 
designs. However, only the architectural design cost of the accelerators is incurred. 
This strategy can be applied one level down, to the processors. Most of a processor’s 
architecture can be kept fixed, but certain of the FUs, for instance, may be customized 

[ 5 ]. 

Minimizing physical design cost. An embedded system may either fit on a single 
chip, or be spread over multiple chips. For every part of a chip that has been 
customized, one must necessarily bear the cost of placement and routing. One can 
avoid this cost for the rest of the chip by using what is known as hard IP, i.e., 
subsystems that have been taken through physical design. Although the 
floorplanning, placement and routing for the chip as a whole must still be performed, 
the hard IP blocks are treated as atomic components, greatly reducing the complexity 
of this step. However for every chip that has been customized, even to a small extent, 
the entire cost of creating the mask sets for the chip must be borne. 

At lower levels of VLSI integration, a system used to consist of multiple chips, of 
which only a few might have needed to be custom. Furthermore, the cost of creating a 
mask set was relatively low. The advent of SOC greatly reduces the cost of a complex 
embedded system. However, it comes with a disadvantage; any customization of the 
system, however tiny, requires a new mask set. Worse yet, the cost of creating a mask 
set is now in the hundreds of thousands of dollars, and rising. 

Avoiding mask set costs. This is a powerful incentive to avoid VLSI design 
completely, motivating the notion of OTS, customizable SOCs or “reconfigurable 
hardware”. The basic idea, as before, is to fix certain aspects of the system’s and 
processors’ architectures, and to allow the rest to be custom. The difference is that 
instead of implementing the custom accelerators and FUs using standard cells, they 
are implemented by mapping them, after performing logic synthesis, on to FPGAs. 
Accordingly, the OTS SOC contains the fixed portions of the system architecture, 
implemented as standard cells, but provides FPGAs instead of the custom processors, 
accelerators and FUs. By programming the FPGAs appropriately, the OTS SOC can 
be customized to implement a number of different custom system architectures. The 
entire architectural design cost for the custom subsystems is incurred, as are the costs 
of programming the FPGAs (similar in many ways to placement and routing), but the 
VLSI design costs are eliminated. 

This is an extremely attractive approach when the desired custom system fits the 
system-level architecture of the OTS, customizable SOC. If not, the SPV must design 
a custom SOC. For high volume products, where application needs are well under- 
stood, and where high design cost and design time can be tolerated, customized SOCs 
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provide higher performance at a lower cost than do FPGAs. However, its program- 
mability allows an OTS FPGA to serve a far greater variety of immediate product 
needs and allows complex products to get to market more quickly and to evolve with 
changing application requirements. Further benefits of this approach are quick 
prototyping and field programmability in the event that the nature of the customiza- 
tion needs to be changed. 

Note that our discussion of OTS customizable system here and in Section 2.3 are 
two views of the same thing. There, our view of it was as an alternative style of ar- 
chitecture. Here, our view of it is the more traditional one, as a way of implementing 
a hardware design. 



4 Automation of Computer Architecture 

Embedded computing, with its distinct set of requirements and constraints, is gener- 
ating the need for large numbers of custom embedded systems. Whether they are 
implemented as custom SOCs or by using OTS, customizable SOCs, the architectural 
costs must be incurred. We believe that the need for mass customization, and its asso- 
ciated architectural costs, have brought into existence a new theme in computer ar- 
chitecture — the automation of computer architecture. We call this architecture synthe- 
sis in order to distinguish it from other forms of high-level synthesis such as behav- 
ioral synthesis [4, 7] and, to distinguish it from low-level synthesis such as logic 
synthesis [8], 



4.1 When Is Automation Important? 

One could argue that for the foreseeable future, an automatically architected computer 
system can be expected to be less well-designed than a manually architected and 
tuned design. This statement is not as obviously true as it might sound; the ability of 
an automated system to evaluate thousands of disparate designs could quite conceiva- 
bly yield a superior design. But if we accept this statement as true, the question that 
arises is under what circumstances it is desirable to use automation. We believe that 
there are at least three sets of circumstances that argue for automation. 

The first one is when the desired volume of custom designs, stimulated by an ex- 
plosion in the number of smart products, exceeds the available design manpower. The 
demand might be due to a large number of either application-specific or domain- 
specific designs. It might also be due to shortened product life cycles, either because 
the relevant standards are evolving, or because of competitive pressures in a con- 
sumer business. Automation addresses the problem by sharply increasing the aggre- 
gate design bandwidth. 

Automation is also useful when time-to-market or time-to-prototype is crucial. 
Often, the product definition could not be anticipated far enough ahead of time to use 
manual design methodologies, perhaps because a relevant standard had not yet con- 




236 B.R. Rau and M.S. Schlansker 



verged, or perhaps the functionality of a new product was not yet clearly understood. 
In such cases, the speed of automated techniques is of great value. 

A third motivation for automation occurs when the expected volume of the custom 
design is too small to permit the product to be economical using a manual design 
process. Automation reduces the design cost (which must be amortized over the small 
volume), and can make such a product viable. 



4.2 A Philosophy of Automation 

A typical reaction to the notion of automating computer system design is that it is a 
completely unrealistic endeavor. Typically, the assumption underlying this reaction is 
that the automatic design system would emulate the human design methodology. That 
would, indeed, be a very hard problem in artificial intelligence, since human design- 
ers tend to invent new solutions to problems they encounter during design. We do not 
believe that one should try to build an architecture synthesis system that does this. 
Instead, our approach picks the most suitable design out of a large, possibly un- 
bounded, denumerable design space. This space has to be large and diverse enough to 
ensure that there is a sufficient repertoire of good designs so that a best design, se- 
lected from the space, will closely match application needs. 

Framework and Parameters. Since it is impractical to explicitly enumerate every 
feasible design, the space of designs is defined by a set of rules and constraints that 
must be honored by each design in the space. We call this a framework. Within a 
framework, some aspects of the design, such as the presence of certain modules and 
the manner in which they are connected are predetermined. Other aspects of the 
design are left unspecified. Of these, some can be derived once the rest have been 
specified. We refer to the latter as parameters. The specification of a design consists 
of binding the values of the parameters. From these, the derived aspects of the design 
are computed. Together, and in the context of the framework, they constitute a 
completely specified design. We define construction to be the process of deriving the 
detailed, completely specified design once the parameters are given. 

Our philosophy for deciding what is a parameter, and what is not, is determined 
operationally. When we believe that we have an algorithmic way of determining 
certain design details in an optimal or near-optimal manner, we view their definition 
as part of implementation. When we have no clear way of determining important 
attributes of a design, and we use heuristic search to determine well-chosen values, 
we view them as parameters. After all design parameters are bound, a design is com- 
pletely specified and the design can then be constructed. In the specification of a 
given design, it is often the case that not every combination of parameters is valid. 
For a system design to be valid, certain parameters of one subsystem must match the 
corresponding parameters of another subsystem. These are expressed as validity con- 
straints involving the parameters that must match [1], 
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Components. Designs are constructed by assembling lower-level components, picked 
from a component library. Sometimes, these components are parameterized with 
respect to certain of their attributes; once the parameters are specified, a component 
constructor can be used to instantiate the corresponding component. The components, 
or their constructors, are designed and optimized manually. The components must fit 
into the framework and, must collectively provide all of the building blocks needed to 
construct any design. 

In addition to its detailed design, each component must have associated informa- 
tion needed during the construction of the system design. Information needed to 
properly interface components into a broader design context includes a description of 
a component’s functional capability, a description of a component’s input and output 
wiring needs, and a description of a component’s externally visible timing and re- 
source requirements. Components are often described as a network of lower-level 
components forming a design hierarchy. 

A Paradigm for Automation. Typically, one has multiple evaluation metrics in mind 
(such as cost, performance and power) when picking a good design. Thus, finding an 
optimal design involves a multi-objective optimization task. A design is said to be a 
Pareto-optimal design, or a Pareto design for short, if there is no other design that is 
at least as good as it with respect to every evaluation metric, and better than it with 
respect to at least one evaluation metric. The set of all Pareto designs is the Pareto 
set, or the Pareto for short. 

We automatically find a Pareto set using three interacting modules. The space- 
walker explores the space of possible designs, looking for the Pareto-optimal ones. 
The space of possible designs is specified to the spacewalker by the user, who pro- 
vides a range of values for each parameter. The design space is the Cartesian product 
of the sets of values for the various parameters. It defines the space of designs that the 
spacewalker must explore. At each step in the search, the spacewalker specifies a 
design by binding parameters. A constructor can take a design, as specified by the 
spacewalker, and construct a hardware realization of the chosen design. The effects of 
a component binding are evaluated using an evaluator that determines the suitability 
of the spacewalker’ s choice. Evaluation is most accurately performed by first exe- 
cuting the constructor for a design, with appropriately bound parameters, to produce a 
detailed design. Then the evaluators use the detailed design to compute the evaluation 
metrics. When the cost of constructing candidate detailed designs is excessive, ap- 
proximate evaluation metrics can sometimes be quickly estimated directly from de- 
sign parameters. The evaluation process uses multiple tools including compilers, 
simulators, and gate count estimators. 

At each step in the search, the spacewalker invokes the constructor and evaluators 
to determine whether, in the context of the designs evaluated thus far, the latest de- 
sign is Pareto-optimal. If the design space is small, the spacewalker may use an ex- 
haustive search. Otherwise, at each step, it uses the evaluation metrics, possibly other 
statistics relating to the design, and appropriate heuristics to guide it in taking the next 
step in its search. The goal in this case is to find all, or most, of the Pareto-optimal 
designs while having examined a very small fraction of the design space. Space- 




238 B.R. Rau and M.S. Schlansker 



walking is hierarchical when an optimized system is designed using a spacewalking 
search and components of the system are treated as sub-systems that are in turn opti- 
mized using lower-level spacewalking searches. 

A framework restricts the generated design to a subset of all possible designs that a 
human might have created. However, it is precisely from this that the power of a 
framework arises. It is difficult to conceive of how one could create constructors and 
evaluators capable of constructing and evaluating any possible design. Yet, without 
constructors and evaluators, automation would be impossible. The limits placed by a 
framework, on the types of designs that have to be evaluated and constructed, are 
crucial to making automation possible. The challenge, when designing a framework, 
is to choose one that is large and diverse enough to contain good designs, while at the 
same time retaining the ability to evaluate and construct every design in that frame- 
work. 



5 The PICO Architecture Synthesis System 

In order to illustrate our automation philosophy, we briefly describe PICO (Program 
In, Chip Out), our research prototype of an architecture synthesis system for auto- 
matically designing custom, embedded computer systems. It employs our paradigm 
for automation, in a hierarchical fashion, four times over. PICO takes an application 
written in C, automatically architects a set of Pareto-optimal system designs, and 
emits the structural VHDL for them. Currently, optimality is defined by two evalua- 
tion metrics: cost (gate count or chip area) and performance (execution time). PICO 
explores trade-offs between the ways in which silicon area can be utilized in such a 
system, presenting to the user a set of Pareto-optimal system designs. In the process, 
PIGO does hardware-software co-design — partitioning the given application between 
hardware (one or more custom NPAs) and software (on a custom EPIC/VLIW proc- 
essor 1 [10]). PICO also retargets a compiler to each custom VLIW processor; we call 
this processor-compiler co-design. 



5.1 System Synthesis 

At the system level, PICO’s task is to identify the Pareto-optimal set of custom, appli- 
cation-specific, embedded system designs for a given application. Each system that 
PICO designs consists of a custom VLIW processor and a custom, two-level cache 
hierarchy. The cache hierarchy consists of a first-level data cache (Dcache), a first- 
level instruction cache (Icache), and a unified second-level cache (Ucache). In addi- 
tion, the system may contain one or more custom NPAs that work directly out of the 
second-level cache. PICO exploits the hierarchical structure of this design space. 
Within PICO’s framework, a system design consists of a VLIW processor, one or 



1 EPIC (Explicitly Parallel Instruction Computing) is a generalization of VLIW. For conven- 
ience, in the rest of this paper we use the term VLIW to include EPIC as well. 
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more NPAs, and a cache hierarchy. Accordingly, PICO decomposes the system de- 
sign space into smaller design spaces, one for each of these major subsystems. The 
components that PICO uses to create a system-level design are the custom, applica- 
tion-specific VLIW processors, NPAs and cache hierarchies that are yielded by the 
spacewalkers and constructors for the various subsystems, as discussed below. The 
system-level parameters are the union of the parameters for the VLIW processor, the 
NPAs, and the cache hierarchy. 

The system design space typically contains millions of designs, each requiring an 
hour or more of profiling, compilation, synthesis and simulation time. An exhaustive 
search of the design space is infeasible. Instead, PICO exploits the hierarchical struc- 
ture of the design space. The basic intuition is that Pareto-optimal systems are com- 
posed out of Pareto-optimal component subsystems [I], Accordingly, the system- 
level spacewalker invokes the subsystem-level spacewalkers to get the Pareto-optimal 
sets of subsystem designs. The set of all combinations of Pareto-optimal subsystems 
is far smaller that the original design space. In the simplest case, the system-level 
spacewalker would consider all these combinations, evaluate them, and discard all 
that are not Pareto-optimal system designs. But due to validity constraints, not all 
combinations are valid. Therefore, the system-level spacewalker requires that each 
subsystem-level spacewalker return not just a single Pareto set, but rather a set of 
parameterized Pareto sets. When composing subsystems, the system-level space- 
walker enforces validity constraints by only combining designs from compatible, 
parameterized Pareto sets. For such compatible Pareto sets, it considers all combina- 
tions of subsystems, evaluates them, and discards all that are not Pareto-optimal sys- 
tem designs. 

For each loop nest in the application, that is a candidate to be implemented in 
hardware, the spacewalker examines both options. If there are N such loop nests, the 
spacewalker explores 2 N system architectures, from those that have no NPAs at all, to 
those in which every loop nest has been implemented as a NPA. For each system 
architecture, it comes up with the Pareto set as described above. It then forms the 
union of these Pareto sets and finds the Pareto-optimal designs within this union. This 
final Pareto set contains all designs of interest for the application. Typically, at low 
performance levels, the Pareto-optimal designs contain no NPAs. Conversely, at 
sufficiently high levels of performance, all of the loop nests may be implemented as 
NPAs. In this manner, the system-level spacewalker makes different hardware- 
software partitioning decisions for Pareto optimal systems with varying cost and 
performance. 

The system-level constructor utilizes the constructors for the VLIW, NPA and 
cache hierarchy subsystems to construct the subsystem designs. It then glues these 
subsystem designs together by synthesizing the appropriate hardware and software 
interfaces between the VLIW processor and the NPAs. Likewise, the system-level 
evaluators make use of the subsystem-level evaluators. System designs are evaluated 
by adding the costs and the execution times, respectively, of the component subsys- 
tems. 
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5.2 YLIW Synthesis 

PICO-VLIW is the PICO subsystem that designs custom, application-specific VLIW 
processors and generates a parameterized set of Pareto sets [3], In addition, it retar- 
gets Elcor (PICO’s VLIW compiler) to each new processor so that it can compile the 
C application to that processor. The processors currently included within PICO- 
VLIW’ s framework encompass a broad class of VLIW processors with a number of 
sophisticated architectural and micro-architectural features [6, 9]. A complete specifi- 
cation of a VLIW processor within this framework involves hundreds of detailed 
decisions. If all of these were parameters, it would result in an extremely unwieldy 
design space exploration task. Our choice of the interface between the spacewalker 
and the VLIW constructor involves a delicate balance between giving the space- 
walker adequate control over the architecture, without bogging it down by requiring it 
to specify all details. Our compromise is that the parameters that the spacewalker 
must specify are limited to the sizes and types of register files, the operation reper- 
toire, and the requisite level of ILP concurrency. Thereafter, it is the job of the VLIW 
constructor to make the remaining detailed design decisions. 

The number of parameters needed to specify a VLIW design is still relatively 
large. Consequently, even if the range for each parameter is small, the size of the 
design space can be extremely large. Furthermore, the evaluation of the performance 
of a VLIW design is time-consuming, since it involves compiling a potentially large 
application. The spacewalker, therefore, explores the design space using sophisticated 
search strategies and heuristics to prune the design space based on previously evalu- 
ated processor designs. 

For each set of parameters generated by the spacewalker, the VLIW constructor 
designs the architecture and micro-architecture of the specified VLIW processor, 
including the execution datapaths, the instruction format and the instruction unit, and 
emits structural VHDL. It also automatically extracts that part of the machine- 
description database ( mdes ) [9] that drives Elcor during scheduling and register allo- 
cation. The VLIW constructor uses RTL components from PICO’s macrocell library 
(such as adders, multiplexers and register files) to synthesize the VLIW processor. In 
addition to the gate-level design, each component has associated with it information 
regarding its area, gate count, and degree of pipelining. Functional unit macrocells 
also are annotated with the set of opcodes that they can execute. 

The cost evaluator estimates the chip area and gate count for the design using pa- 
rameterized formulae for area and gate count that are attached to each component in 
the macrocell library. These formulae are calibrated against actual designs. The per- 
formance evaluator estimates the execution time of the given application on the newly 
designed processor using Elcor and PICO’s retargetable assembler. Both are auto- 
matically retargeted by supplying them with the mdes for the target processor. The 
schedule created by Elcor, along with the profiled frequency of execution of each 
basic block, suffices to estimate the execution time. The object code generated by the 
assembler serves two objectives during design space exploration. One is to evaluate 
the code size and its impact upon the cost of main memory. The other is to permit an 
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estimate of its effect upon the Icache and Ucache miss rates and the resulting impact 
on execution time. 



5.3 NPA Synthesis 

PICO-N is the PICO subsystem for designing NPAs customized to a given loop nest 
and for obtaining a parameterized set of Paretos for such NPAs [11]. A design in the 
NPA framework consists of a synchronous, one- or two-dimensional array of cus- 
tomized processing elements (PEs) along with their local memories and interfaces to 
global memory, a controller, and a control and data interface to a host processor. Each 
PE is a datapath with a distributed register file structure. Each register file is a FIFO 
with random read access. Interconnections between the FIFOs and FUs exist only as 
needed by the computation, resulting in a sparse and irregular interconnect structure 
with connections to only some of the FIFOs’ elements. PICO-N also synthesizes the 
code required to make use of the NPA. 

The design parameters are the number of PEs, the initiation interval (II) between 
starting successive iterations on any single PE, and the amount of memory bandwidth 
to the second-level cache. Since the number of parameters is small, the spacewalker 
performs an exhaustive search through the design space defined by the ranges of the 
three parameters. Because the precise geometry of the array of PEs is not specified as 
a parameter, the spacewalker steps through all one- and two-dimensional array 
geometries which have the specified number of PEs. This serves as an additional 
parameter for the constructor. The results of this search are used to create a set of 
Paretos, each one parameterized by the number of memory ports that the NPA has to 
the second level cache. 

The NPA constructor starts off with a loop nest expressed as a sequential compu- 
tation working out of global memory. It first tiles the loop nest, creating a new set of 
sequential outer loops, that will run on the VLIW processor, along with an inner loop 
nest with fewer iterations (the tile), that will be executed in parallel by the NPA. This 
transformation allows the constructor to not exceed the available global memory 
bandwidth, by performing register promotion in the inner loop nest, while minimizing 
the cost of the additional registers that this entails. Smaller tiles result in lower hard- 
ware cost but higher memory traffic. The constructor uses constrained combinatorial 
optimization to minimize the tile volume without exceeding the memory bandwidth 
allocated to the NPA. Next, the constructor transforms the inner loop nest into multi- 
ple, identical, synchronously parallel computations, one per PE. It does so by assign- 
ing a PE and a start time to each iteration in the tile while honoring data dependences 
between iterations. Furthermore, from the perspective of each processor, this iteration 
schedule is required to start an iteration every II cycles. This allows the constructor to 
express the computation on each PE as a single loop that performs all of the iterations 
assigned to that PE. 

This loop is used to synthesize a single PE. Using user-specified pragmas regard- 
ing the requisite bit widths of variables, the constructor infers the minimum width 
requirements for all variables, temporaries, and operations. This allows the width of 




242 B.R. Rau and M.S. Schlansker 



every register, FU and datapath to be minimized. A minimum-cost set of FUs are 
allocated and the operations of the loop body are assigned to these functional units 
and scheduled in time. Elcor is used to perform software-pipelining of the loop at the 
specified II using a variety of heuristics to minimize hardware cost. At this point the 
hardware for one PE is materialized, using the RTL components in the macrocell 
library. Register files, in the form of FIFOs with random read access are allocated to 
hold temporary values. Interconnections between the FIFOs and FUs are created only 
as needed, resulting in a sparse and irregular interconnect structure with connections 
to only certain of the FIFOs’ elements. 

An NPA consists of multiple instances of the PE, configured as an array. As many 
copies of the PE are created as specified by the design parameter, and interconnected 
in the specified geometry. The controller and the global memory interface are gener- 
ated. The NPA, including its registers and array-level local memories, is accessed via 
a local memory interface by the VLIW processor. It may be initialized and examined 
using this interface. Finally, structural VHDL for the NPA is emitted. 

The NPA constructor also performs some software synthesis. It takes the loop nest 
after tiling, with its additional outer loops, and removes the inner loop nest that has 
now been implemented as hardware. In its place, it generates the code that will invoke 
the NPA after making the appropriate initializations via the NPA’s local memory 
interface. This new loop nest is inserted back into the application in place of the loop 
nest that was presented to PICO-N. Instead of executing the loop nest on the VLIW 
processor, the application will now trigger the computation on the NPA. 

The cost evaluator estimates the chip area and gate count for the NPA, as described 
earlier for the VLIW processor. The performance evaluator for the NPA (which exe- 
cutes a predictable loop nest) is estimated using a formula instead of via simulation. 



5.4 Cache Hierarchy Synthesis 

The third major PICO subsystem automatically generates a parameterized set of Pa- 
reto sets for cache hierarchies that have been customized to the given application [2], 
A design within the cache hierarchy framework consists of a first-level Dcache, a 
first-level Icache and a second-level Ucache. Just as at the system level, PICO de- 
composes the cache hierarchy design space into smaller design spaces for the Dcache, 
Icache and Ucache, respectively. Each of the three caches — Icache, Dcache and 
Ucache — is parameterized by the number of sets, the degree of associativity, the line 
size, and the number of ports. A valid cache hierarchy design must have parameters 
that are compatible as specified by the validity constraints, e.g. the porting on the 
Ucache must at least be equal to the Dcache porting (if data fetched from the Ucache 
is permitted to bypass the Dcache). A final parameter, dilation, is an attribute of the 
code size for each VLIW processor. This parameter determines not the design of the 
cache hierarchy, but rather the performance of the Icache and Ucache. 

The spacewalker takes advantage of the fact that the cache hierarchy design space 
is decomposed into three smaller design spaces corresponding to the Icache, Dcache 
and Ucache, respectively. For each of these design spaces, a parameterized set of 
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Pareto sets is formed. Each design space’s set contains member Pareto sets which are 
formed separately for each setting of the values of the parameters that participate in 
validity constraints The recomposition step uses these Paretos to form parameterized 
Pareto sets for the overall cache hierarchy, while enforcing validity constraints by 
only considering those combinations of Icache, Dcache and Ucache Pareto sets that 
are compatible. 

The costs of the leaches, Dcaches and Ucaches are evaluated using parameterized 
formulae. Trace-driven simulation is used to evaluate the miss rates, using the Chee- 
tah cache simulator [13] which exploits inclusion properties between caches to simu- 
late, in a single pass through the address trace, a range of cache designs with a com- 
mon line size. This trace-driven cache simulation is done once for a reference VL1W 
processor. To a first order of approximation, the Dcache miss rate is assumed to be 
unaffected by the details of the VLIW processor. But this is not the case for the 
Icache and Ucache miss rates. Analytic techniques that use the dilation parameter, 
combined with interpolation of the reference processor’s miss rate, are used to esti- 
mate the performance of these caches. 



6 Conclusions 

System-on-chip levels of VLSI integration are causing the center of gravity of the 
computer industry to move into embedded computing. The driving application for 
embedded computing will be a rich diversity of innovative smart products which 
depend, for their functionality, on the availability of extremely high-performance, 
low-cost embedded computer systems. 

Successful computer architecture results from achieving a careful balance between 
the opportunities afforded by the latest technologies, on the one hand, and the re- 
quirements of the market, product and application, on the other. The opportunities, 
needs and constraints of embedded computing are quite distinct from those of gen- 
eral-purpose computing. This creates a new playing field and will lead to substantially 
different computer architectures, at both the system and the processor levels. Embed- 
ded architectures will be far more special-purpose, heterogeneous and irregular in 
their structure. There never was such a thing as the “one right architecture’’, even for 
general-purpose computing. This is even truer for embedded computing, which will 
trigger a renaissance in system and processor architecture. Custom and customizable 
architectures will assume a new importance. 

Furthermore, we believe that the very large number of custom architectures re- 
quired, due to the expected explosion in the number of smart products, will introduce 
a new theme into computer architecture: the automation of computer architecture. Our 
experience with PICO is that this is a perfectly practical and effective endeavor. The 
resulting designs are quite competitive with manual designs, but are obtained one or 
two orders of magnitude faster. 
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Abstract. Within two or three technology generations, processor architects will 
face a number of major challenges. Wire delays will become critical, and power 
considerations will temper the availability of billions of transistors. Many impor- 
tant applications will be object-oriented, multithreaded, and will consist of many 
separately compiled and dynamically linked parts. To accommodate these shifts 
in both technology and applications, microarchitectures will process instruction 
streams in a distributed fashion - instruction level distributed processing (ILDP). 
ILDP will be implemented in a variety of ways, including both homogeneous 
and heterogeneous elements. To help find run-time parallelism, orchestrate dis- 
tributed hardware resources, implement power conservation strategies, and to pro- 
vide fault-tolerant features, an additional layer of abstraction - the virtual machine 
layer - will likely become an essential ingredient. Finally, new instruction sets may 
be necessary to better focus on instruction level communication and dependence, 
rather than computation and independence as is commonly done today. 



1 Introduction 

Processor performance has been increasing at an exponential rate for decades; most 
computer users now take it for granted. In fact, this performance increase has come 
only through concerted effort involving the interplay of microarchitecture, underlying 
hardware technology, and software (compilers, languages, applications). Because of 
changes in underlying technology and software, future microarchitectures are likely to 
be very different from the complex heavyweight superscalar processors of today. 

For nearly twenty years, microarchitecture research has emphasized instruction level 
parallelism (ILP) - improving performance by increasing the number of instructions 
per cycle. In striving for higher ILP, microarchitectures have evolved from pipelining to 
superscalar processing, with researchers pushing toward increasingly wide superscalar 
processors. Emphasis has been on wider instruction fetch, higher instruction issue rates, 
larger instruction windows, and increasing use of prediction and speculation. This trend 
has largely been based on exploiting technology improvements and has led to very 
complex, hardware-intensive processors. 

Regarding applications, the focus for microarchitecture research has been SPEC- 
type applications, consisting of single threaded programs written in the conventional 
C and FORTRAN languages, following a big static compile model. That is, the entire 
binary is compiled at once, with very heavy optimization, sometimes with the assistance 
of profile feedback. 
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Starting with the conventional, big-compile view of software and ever-increasing 
transistor budgets, microarchitecture researchers made significant progress through the 
mid-90s. More recently, however, the problem has seemingly been reduced to one of 
finding ways of consuming transistors in some fashion. It is not surprising that the result 
is hardware-intensive and complex. Furthermore, the complexity is not just in critical 
path lengths and transistor counts; there is also high intellectual complexity resulting 
from increasingly intricate schemes for squeezing performance out of second and third 
order effects. 

In the future, there will be substantial shifts in both software and hardware tech- 
nology. In fact, these shifts are already underway, and the conventional ILP-based mi- 
croarchitecture approach does not address them very well. In software the shift is toward 
object oriented programs that exploit thread level parallelism and dynamic linking. In 
hardware, long wire delays will dominate gate delays. These shifts will lead to gen- 
eral purpose microarchitectures composed of small, simple interconnected processing 
elements, running at a very high clock frequency. Multiple threads, some completely 
transparent to conventional software, will be managed by a thin layer of software, co- 
designed with the hardware and and hidden from the traditional software layers. This 
hidden software layer will manage the distributed hardware resources and dynamically 
optimize the executing threads, based on observed inter-instruction communication and 
memory access patterns. 

In short, the microarchitecture focus will shift from Instruction Level Parallelism 
to Instruction Level Distributed Processing (ILDP) where emphasis will be on inter- 
instruction communication with dynamic optimization and a high level of interaction 
between hardware and software. In the following sections we expand on these ideas. 

1.1 Where We Have Been 

Fig. 1 is a simplified view of processor performance over the past three decades - be- 
ginning at the time when the CDC 7600 was the fastest available computer. At the left 
endpoint of the timeline, the 7600 was designed to 40 MHz, but shipped at a slightly 
slower rate, and was capable of sustaining about .25 instructions per cycle on real appli- 
cations. Main memory was relatively fast compared with the processor speed, and there 
were no caches except for a small, loop-capturing instruction buffer. The compiler was 
capable of software pipelining (limited by the small register set). 

The right endpoint of the timeline is today’s superscalar processor that runs at about 1 
GHz and is capable of sustaining perhaps 1.25 instructions per cycle on real applications 
(often less). Caches are heavily used because of the relatively slower main memory, and 
compilers are quite advanced. 

All told, in the past 30 years, the performance improvement has been about 125 
times, or 16-18 percent per year. Based on clock frequency, technology is responsible 
for a 25 times improvement, and the combination of microarchitecture and compilers 
are responsible for about a five times improvement in the past 30 years. 

The MOS-based microprocessor performance curve has been much steeper, and is 
the one often cited when considering performance trends. However, as Fig. 1 shows, 
this is misleading when one considers the larger picture, and probably sets unattainable 
expectations. Recall that a microprocessor of the 1980 had a clock frequency of about 5 
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year 

Fig. 1 . High-end processor performance since 1970. 



HHz with no pipelining, contrasted with the highest performance bipolar processors of 
the day which were deeply pipelined and ran in excess of 100 MHz. 

Technology has provided faster transistors, as well as more gates available to the de- 
signer/architect. MOS technologies have added substantially to the number of available 
transistors and the rise of CMOS has gone hand-in-hand with the heavy emphasis on 
ILP for general purpose processing. 

A second technology trend is that main memories have become quite large, but 
relatively slow when compared with processors. To accommodate this shift in technolo- 
gies, microarchitects have developed very elaborate memory hierarchies, consisting of 
multi-level caches, and sophisticated prefetch and buffering schemes. These important 
developments do not explicitly appear in the five times microarchitecture/compiler con- 
tribution to overall performance, but they are crucial for mitigating major performance 
losses that would have resulted from the widening gap between processor and memory 
cycle times. 

Turning now to applications and languages, in the early 1970s (and earlier), much of 
the emphasis in high performance computing was on numerical applications. Languages 
were FORTAN, and later, C. Individual users developed and compiled their programs 
(often with great care, to improve performance and to deal with small memories). A 
programs was often compiled for the individual machine on which the program was to 
be run, and forms of manual profiling were typically used as the code was developed. 
For large numerical problems containing high levels of inherent parallelism, vector 
instructions were often employed. 

Over the years, the emphasis in general purpose computing has shifted from float- 
ing point applications to integer commercial applications, but the emphasis is still on 
conventional languages like C. And microarchitecture researchers still assume a heavy 
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emphasis on compiler optimization and profiling, although one often hears that, in prac- 
tice, profiling is seldom used, and unoptimized software is frequently shipped. 



1.2 Where We Are Headed 

Now consider where hardware technology and software are heading. With regard to 
technology, the two most important issues facing microarchitects are on-chip wire delays 
and power consumption. 

In the case of local (short) wires, the problem is one of congestion. That is, for many 
complex, dense structures, transistor sizes do not determine area requirements, wiring 
does. With global (long) wires, delays will not scale as well as transistor delays, because 
wire aspect ratios will be more constrained than in the past and fringing capacitance 
becomes an important factor [ 1 ] . 

An interesting analysis of wire delay implications is given in [2] where reachable 
chip area, measured in terms of SRAM memory cells, is projected. Because of longer 
wire delays in the future, fewer bits of memory will be reachable in a single clock cycle 
than today. For example, in 35 nm technology, it is projected that the number or reachable 
bits will be about half of what is reachable today. 

Of course, reachability is a problem with general logic as well. As logical structures 
become more complex (and take relatively larger area) global delays will increase simply 
because structures are farther apart, even if critical path delays are ignored. To put it 
another way, using simple logic will likely improve performance directly by reducing 
critical paths, but also indirectly by reducing area (and overall wire lengths). As is often 
the case, this is not a new observation - all the S. Cray designs benefited from this 
principle. 

Finally, global multi-stop buses, i.e. buses with several receivers (and possibly multi- 
ple drivers) will have very long delays because of both wire capacitance and the loading 
on the bus. This type of on-chip, intra-processor bus is likely to disappear in the future 
because of its very poor delay characteristics. 

With respect to power, dynamic power is related to voltage levels and transistor 
switching activity, i.e. dynamic power @approx A V sup 2 f@, where A is a measure 
of switching activity, V is the supply voltage and f is the clock frequency. Higher clock 
frequencies and transistor counts have caused dynamic power to become a very big issue 
in recent years. Because of the dependence on voltage level, the trend is toward lower 
power supply voltage levels. 

The power problem is likely to get much worse, however. To maintain high switching 
speeds with reduced supply voltages, transistor threshold voltages are also becoming 
lower [3]. This causes transistors to become increasingly ’’leaky”; i.e. current will pass 
from source to drain even when the transistor is not switching. In the future, the resulting 
static power consumption will likely become dominant. For example, consider Fig. 2 
adapted from [3] . From this figure, trends indicate that static (standby) power will become 
more important than dynamic (active) power within the next few technology generations. 
There are relatively few solutions to the static power problem. One can selectively gate- 
off the power supply to unused parts of the processor, but this will be a higher-overhead 
process than clock-gating used for managing dynamic power and can lead to difficult-to- 
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Technology Generation (um) 



Fig. 2. Trends in dynamic and static power; from [3], 



manage transient currents. Alternatively, one can use fewer transistors, or at least fewer 
fast (i.e. leaky) transistors. 

In software, the trend is toward object oriented languages with dynamic linking. 
And as mentioned earlier, many application codes are not highly optimized anyway, 
and dynamic linking is not compatible with the big static compile model. Furthermore, 
with a variety of hardware platforms all supporting the same instruction set, it becomes 
difficult to optimize a single binary to be executed on all of them. 

Finally, an important trend that brings together microarchitecture and applications is 
the use of on-chip multithreading. Multithreading has a very long tradition, but primarily 
in very high end systems where there has not been a broad software base. For example, 
multiprocessing became an integral component of large IBM mainframes and in Cray 
supercomputers in the early 1980s [4,5]. However, the widespread use of multithreading 
has been a chicken-and-egg problem that now appears to be near a solution. 

As a way of continuing the aggressive MOS performance curve, single chips now 
support multiple threads or will soon [6,7]. This support is either in the form of multiple 
processors [8] or wide superscalar hardware capable of supporting multiple threads si- 
multaneously [9,10]. In any case, the trend is toward widespread availability of hardware- 
supported on-chip multithreading, and general purpose software will no doubt take ad- 
vantage of its availability. Furthermore, additional motivation for on-chip multithread- 
ing is the increasing number of important applications characterized by many parallel, 
independent transactions. For example, many web server applications are becoming 
throughput oriented; general purpose computing is shifting toward multiple latency- 
limited threads. 



2 Instruction Level Distributed Processing 

As was pointed out earlier, computer architecture innovation has done more than exploit 
technology; it has also been used to accommodate technology shifts. In the the case of 
cache memories, architecture innovation was used to avoid tremendous slow downs - by 




250 



J.E. Smith 



adapting to the shift in processor/RAM technologies. Considering the trends described 
above, we appear to be at a point where microarchitecture innovation will be driven 
by shifts in both technology and applications. The goal will be to maintain long term 
performance trends in the face of increasing on-chip wire delays, power consumption, 
and applications that are not compatible with big, static, highly optimized compilations. 

A microarchitecture paradigm which deals effectively with technology and applica- 
tion trends is Instruction Level Distributed Processing (ILDP). A processor following 
the ILDP paradigm consist of a number of distributed functional units, each fairly simple 
with a very high frequency clock cycle (for example, Fig. 3). 




Fig. 3. An example ILDP microarchitecture consisting of an instruction fetch unit, integer proces- 
sors, floating point processors and cache processors. 



The presence of relatively long wire delays implies microarchitectures that explicitly 
account for inter-instruction and intra-instruction communication. As much as possible, 
communication should be localized to small units, while the overall structure should be 
organized for communication. Meanwhile, communication among units will be point- 
to-point (no busses), and will be measured in full clock cycles. Partitioning the system 
to accommodate these delays will be a significant part of the microarchitecture design 
effort. There may be relatively little low-level speculation (to keep the transistor counts 
low and the clock frequency high); determinism is inherently simpler than prediction 
and recovery. 

With high inter-unit communication delays, the number of instructions executed per 
cycle may level off or decrease when compared with today, but overall performance can 
be gained by running the smaller distributed processing elements at a much higher clock 
rate. The structure of the system and clock speeds have implications for global clock 
distribution. There will likely be multiple clock domains, possibly asynchronous from 
one another. 
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Currently, there is an increasing awareness that clock speed holds the key to increased 
performance. For several years, the big push has been for ILP, and gains have been made, 
but they now appear to be diminishing, and it makes sense to push more in the direction 
of higher clock speeds. This idea is not new; the role of clock speed has long been the 
subject of debate among RISC proponents. This was certainly the Cray approach, and it 
is apparent in the evolution of in Intel processors [1 1,12,13] (see Fig. 4). In contrast to 
the Intel processors where the pipelines have been made extremely deep, the challenge 
will be to use simplicity to keep pipelines shallow, even with a very fast clock. 
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Fig. 4. Evolution of Intel processor pipelines. 



Turning to power considerations, the use of a very fast clock in an ILDP computer 
will by itself tend to increase dynamic power consumption. However, the very modular, 
distributed nature of the processor will permit better power management. With most 
units being replicated, resource usage and dynamic power consumption can be managed 
via clock gating. In particular, usage of computation resources can be monitored and 
subsets of replicated units can be used (or not) depending on computation requirements 
and priorities. 

For static power, the high frequency clock will use fast leaky transistors more ef- 
fectively. If transistors consume power even when they are idle, it is probably better to 
keep them busy with active work - which a fast clock will do. In addition, the replicated 
distributed units will make selective power gating easier to implement. Furthermore, 
some units may be just as effective if slower transistors are used, especially if multiple 
parallel copies of the unit are available to provide throughput. 

For supporting on-chip multithreading, an ILDP provides the interesting possibility 
of a hybrid between chip multiprocessors and SMT. In particular, the computation units 
can be partitioned among threads. That is, with simple replicated units, different subsets 
of units can be assigned to individual threads. As a whole, the processor is shared as in 
SMT, but any individual unit services only one thread at a time, as in a multiprocessor. 
The challenge will be the management of threads and resources in such a fine-grain 
distributed system. 

Following sections delve deeper into types of distributed microarchitectures. 
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2.1 Dependence-Based Microarchitecture 

Clustered dependence-based architectures [14] are one important class of ILDP proces- 
sors. The 21264 [15] is a commercial example; a much very earlier and little known 
example was an uncompleted Cray-2 design [16]. In these microarchitectures, process- 
ing units are organized into clusters and dependent instructions are steered to the same 
cluster for processing. 

The 21264 microarchitecture there are two clusters, with different instructions routed 
to each at issue time. Results produced in one cluster require an additional clock cycle 
to be routed to the other. In the 21264, data dependences tend to steer communicating 
instructions to the same cluster. Although there is additional inter-cluster delay, a faster 
clock cycle compensates for the delay and leads to higher overall performance. 




Fig. 5. Alpha 21264 clustered microarchitecture. 



In general, a dependence based design may be divided into several clusters; cache 
processing can be separated from instruction processing; integer processing can be sep- 
arated from floating point, etc. (see Fig. 3). In a dependence-based design, dependent 
instructions are collected together, so instruction control logic within a cluster is likely 
to be simplified, because there is no need to look for independence if it is known not 
to exist. For example, if all the instructions assigned to a cluster are known to form 
a dependence chain (or nearly so), they can be issued in order from a FIFO, greatly 
simplifying issue control logic. 
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The formation of dependent instructions can be done by the compiler, or at various 
stages in the pipeline, the dispatch stage being a good possibility. Waiting until the issue 
stage as in the 21264 may reduce inter-unit communication slightly, but at the expense 
of more complex issue logic. 

2.2 Heterogeneous ILDP 

Another model for an ILDP is a heterogeneous system where a simple core pipeline is 
suiTounded by outlying helper engines (Fig. 6). These helper engines are not in the critical 
processing path, so they have non-critical communication delays with respect to the main 
pipeline, and may even use slower transistors to reduce static power consumption. 

Examples of helper engines include the pre-load engine of Roth and Sohi [17] where 
pointer chasing can be performed by a special processing unit. Another is the branch 
engine of Reinman et al. [18]. An even more advanced helper engine is the instruction 
co-processor described by Chou and Shen [19]. Helper engines have also been proposed 
for garbage collection [20] and correctness checking [21], 




Fig. 6. A heterogeneous ILDP chip architecture. 



3 Managing ILDP: Co-designed Virtual Machines 

It seems clear that an ILDP computer will need some type of higher level management 
of the distributed resources used by executing instructions. This management involves 
the interactions among instructions making up a program, e.g. control and data depen- 
dences encoded in the instructions, and the interactions between the instructions and the 
computing/communication resources. 

Determination of instruction interactions can be done by compiler-level software. 
At compile time, inter-instruction dependences and communication can be determined 
(or predicted), then this information can be encoded into machine level instructions. At 
runtime this information can be used to steer instruction control and data information 
through the distributed processing elements. 
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Alternatively, hardware can be used to determine the necessary inter-instruction 
attributes by using hardware tables to collect dynamic history information as programs 
are executed. Then, this history information can be accessed by later instructions for 
steering of control and data information. 

Availability and usage of processor resources is another important consideration; 
for example, resource load balancing will likely be needed for good performance per- 
formance - at both the instruction level and thread level. For power efficiency, gating 
off unused or unneeded resources requires usage analysis and coordination, especially 
if power gating is widespread across a chip. Implementation of fault tolerance through 
replicated processing units also requires higher level management. This function can 
potentially be done via hardware or software, implemented as part of the OS. 

While they are viable solutions, a big disadvantage of software approaches based 
on conventional OS and compilers is that they likely require re-compilation and OS 
changes to fit each particular 1LDP hardware platform. Programs today must be recom- 
piled to get maximum performance from the latest superscalar implementation, but they 
often get reasonable performance without recompilation (or even optimization in some 
cases). ILDP microarchitectures, however, may not be as forgiving as a homogeneous 
superscalar processor, although this remains to be seen. Disadvantages of the hardware- 
intensive solution are complex, power-consuming hardware and a rather limited scope 
for collecting information regarding the executing instruction stream. 

An important alternative to using conventional software or hardware solutions is 
provided by currently evolving dynamic optimizing software and virtual machine tech- 
nologies. A co-designed virtual machine is a combination of hardware and software 
that implements a virtual architecture (VA) [22,23,24], Part of the implementation is in 
hardware - which supports an instruction set with implementation-specific features (the 
Implementation Architecture or IA). The other part of the implementation is in software, 
which translates the VA to the IA and which provides the capability of dynamically op- 
timizing a program. A co-designed VM should be viewed as a way of giving hardware 
implementors a layer of software with which to work. This software layer provides flexi- 
bility in managing the resources that make up a ILDP microarchitecture. It also liberates 
the hardware designer from supporting a legacy VA purely in hardware. 

With ILDP, the distributed processor resources must be managed with a global view. 
For example, instructions and data must be routed in such a way that resource usage is 
balanced and communication delays (among dependent instructions) are minimized, as 
with any distributed system. This would require high complexity hardware, if hardware 
alone were given responsibility. For example, the hardware would have to be aware of 
elements of program structure, such as data and control flow. However, conventional 
issue-window-based methods give hardware only a restricted view of the program, and 
the hardware would have to re-construct program structure information by viewing a 
small part of the instruction stream as it flows by. 

Hence, in the co-designed VM paradigm (Fig. 7), software is responsible for de- 
termining program structure, dynamically re-optimizing code, and making complex 
decisions regarding management of ILDP. Hardware implements the lower level per- 
formance features that are managed by software. Hardware also can collect dynamic 
performance information and may trigger software when unexpected conditions occur. 
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Besides performance, the VM layer can also be used for managing resources to reduce 
power requirements [25] and to implement fault tolerance. 




Fig. 7. Supporting an instruction level distributed processor with a co-designed Virtual Machine. 



4 The Role of Instruction Sets 

Historically, instruction sets have tended to evolve in discrete steps. After the original 
mainframe computers, instruction sets more-or-less stabilized until the early 1970s when 
minicomputers came on the scene. These machines used relatively inexpensive pack- 
aging and interconnections, and provided an opportunity to re-think instruction sets. 
Based on lessons learned from the relatively irregular mainframe instruction sets, regu- 
larity and ’’orthogonality” [26] became the goals. While incorporating these properties, 
mini computer IS As typically supported relatively powerful, variable-length instructions; 
the PDP-11 and later VAX- 11 instruction sets are good examples. As microprocessors 
evolved toward general purpose computing platforms, there was another re-thinking of 
instruction sets, this time with hardware simplicity as a goal. These RISC instruction 
sets, among other things, allowed a full pipelined processor implementation to fit on 
a single chip. As transistor densities have increased, we have reached the point where 
older, more complex microprocessor instruction sets can now be dynamically translated 
on-chip into RISC-like operations. 

In retrospect, it seems that instruction set innovation has keyed off packaging/techno- 
logy changes, from mainframes to minicomputers to microprocessors. And now may be 
a good time to have another serious investigation of instruction sets. This time, however, 
on-chip communication delays, high speed clocks, and advances in translation/virtual 
machine software provide the motivation. 

Instruction sets can and should be optimized for ILDP. Features of new instruction 
sets should focus on communication and dependence, with emphasis on small, fast mem- 
ory structures, including caches and registers. For example, variable length instructions 
lead to smaller instruction footprints and smaller caches. While legacy binaries have 
inhibited new instruction sets, virtual machine technology and binary translation enable 
new implementation-level instruction sets. 

Most recent instruction sets, including RISC instructions sets, and especially VLIW 
instruction sets, have emphasized computation and independence. The view was that 
higher parallelism could be achieved by focusing on computational aspects of instruction 
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sets and on placing independent instructions in proximity either at compile time or during 
execution time. For ILDP, however, instruction sets should be targeted at communication 
and dependence. That is, communication should be easily expressed and dependent 
instructions should be placed in proximity, to reduce communication delays. 

There are at least three types of related information that are important for an ILDP 
instruction set to express: i) instruction dependence information; ii) instruction steering 
information (and possibly data steering information) to guide instructions and data to the 
proper distributed processing elements so that dependent instructions can be executed in 
proximity with each other, iii) value usage information; in particular, if a value is used 
only once, it does not have to be communicated beyond the consuming instruction. 

One possibility for conveying this type of information is to add tag bits to con- 
ventional instruction sets. These tag bits can explicitly indicate instruction dependence 
information, instruction steering information, and value usage information, .ambiguity 
in term value locality above .rewrite the following This is similar to the independence 
bits added to long instruction words in the Intel IA-64 instruction set [27], 

It is also possible to use implicit methods of specifying the above information. For 
example, a stack-based instruction set places the focus on communication and depen- 
dence. Dependent instructions communicate via the stack top; hence, communication 
is naturally expressed. Dependent instructions tend to be clustered together, and local 
values appear on the stack top and then are immediately consumed. Furthermore stack- 
based IS As tend to have a small instruction footprints which will lead to smaller (faster) 
instruction cacheing structures. 

As a more complete example, consider the following accumulator-based ISA. As- 
sume 64 general purpose registers and a single accumulator are used for performing 
operations. All operations must involve the accumulator, so dependent operations are 
explicitly apparent as is local value communication. With such an ISA, there is need for 
only one general purpose register field per instruction and the ISA can be made quite 
compact, with instructions 1 ,2 or 4 bytes in length. For example, consider the following 



basic instruction types. 






R <- A 


1 


byte 


A <- R 


1 


byte 


A <- A op R 


2 


bytes 


A <- A op imm 


4 


bytes 


A <- M(R op imm) 


4 


bytes 


M(R op imm) <- A 


4 


bytes 


R <- M (A op imm) 


4 


bytes 



The first two instructions copy data to/from a register and the accumulator; A is the single 
accumulator and R is one of the general purpose registers. The next two instructions are 
examples of operations on data held in the accumulator (and general register file). The 
last three instructions are example loads and stores. 

With an instruction set of this type, dependent instructions will naturally chain to- 
gether via the accumulator and will be contiguous in the instruction stream. With a 
clustered ILDP implementation, all the instructions in a dependent chain can be steered 
simultaneously to the same cluster, with the next dependent chain being steered to an- 
other cluster. If the accumulator is re-named within each cluster [16], the parallelism 
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among dependence chains can be exploited, with global communication taking place 
via the general registers. Because it contains only dependent instructions, the instruction 
issue queue in each cluster will be simplified as will local data communication through 
the accumulator. 

The stack and single accumulator ISAs are simple examples of instruction sets that 
implicitly specify communication/dependence information; whether the implicit or ex- 
plicit methods are better is not clear. The important point is that instruction sets de- 
served renewed study, and future technologies and ILDP microarchitectures provide 
fertile ground for innovation in this area. 



5 Summary 

Technology and application shifts are pointing toward instruction level distributed pro- 
cessing. These microarchitectures will contain distributed resources and will be explicitly 
structured for inter-unit communication. Helper processors may be distributed around 
the main processing elements to perform more complex optimizations and to perform 
highly parallel tasks. By constructing simple distributed processing elements, a very 
high clock rate can be achieved, probably with multiple clock domains. Replicated dis- 
tribution processing elements will also allow better power management. 

Virtual machines fit very nicely in this environment. In effect, hardware designers 
can be given a layer of software that can be used to coordinate the distributed hardware 
resources and perform dynamic optimization from a higher level perspective than is 
available to hardware alone. Finally, it is once again time that we reconsider instruction 
sets with the focus on communication and dependence. New instructions sets are needed 
to mesh with ILDP implementations, and they are enabled by the VM paradigm which 
makes legacy compatibility less important at the implementation architecture level. 
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Abstract. Architects of future generation processors will have hundreds 
of millions of transistors with which to build computing chips. At the 
same time, it is becoming clear that naive scaling of conventional (su- 
perscalar) designs will increase complexity and cost while not meeting 
performance goals. Consequently, many computer architects are advo- 
cating a shift in focus from high-performance to high-throughput with 
a corresponding shift to multithreaded architectures. Multithreaded ar- 
chitectures provide new opportunities for extracting parallelism from a 
single program via thread level speculation. We expect to see two ma- 
jor forms of thread-level speculation: control-driven and data-driven. We 
believe that future processors will not only be multithreaded, but will 
also support thread-level speculation, giving them the flexibility to oper- 
ate in either multiple-program/high-throughput or single-program/high- 
performance capacities. Deployment of such processors will require in- 
novations in means to convey multithreading information from software 
to hardware, algorithms for thread selection and management, as well as 
hardware structures to support the simultaneous execution of collections 
of speculative and non-speculative threads. 



1 Introduction 

The driving forces behind the tremendous improvement in processing speed have 
been semiconductor technology and innovative architectures and microarchitec- 
tures. Semiconductor technology has provided the “bricks and mortar” — in- 
creasingly greater numbers of increasingly faster on-chip devices. Innovations in 
computer architecture and microarchitecture (and accompanying software) have 
provided techniques to make good use of these building materials to yield high- 
performance computing systems. Computer designs are constantly changing as 
architects search for (and often find) innovations to match technology advances 
and important shifts in technology parameters; this is likely to continue well into 
the next decade. 

The process of deciding how available semiconductor resources will be used 
can be decomposed into two. First, the architect must decide on the desired 
functionality: the techniques used to expose, extract and enhance performance. 
Then comes the problem of implementation: the techniques must translated to 
structures and signals which must themselves be designed, built and verified. 



M. Valero, V.K. Prasanna, and S. Vajapeyam (Eds.): HiPC 2000, LNCS 1970, pp. 259- ^717] 2000. 
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Though described separately, these issues are, in practice, very tightly coupled 
in the overall design process. In the 1990s, novel functionality played the domi- 
nant role in microprocessor design. With a “reasonable” limit on the overall size 
of a design (e.g., fewer than tens of millions of transistors), the transistor budget 
could be divided by high-level performance metrics only. Verification was (rel- 
atively) simple and many of the problems encountered during implementation 
were manageable: wire delays were not significant as compared to logic delays, 
and power requirements were not exorbitant. In the future, however, implemen- 
tation issues are likely to dominate even basic functionality. Monolithic designs 
occupying many tens or hundreds of millions of transistors will be very diffi- 
cult to design, debug, and verify, and increasing wire delays will make intra-chip 
communication and clock distribution costly. These technology trends suggest 
designs that are made of replicated components, where each component may be 
as much as a complete processing element. Distributed, replicated organizations 
can “divide and conquer” the complexities of design, debug and verification, and 
can exploit localities of communication to deal with wire delays. 

Fortunately, the twin goals of increasing single-program performance and 
easing implementation are not in conflict. In fact, with the right model for paral- 
lelism they can be synergistic. Speculative multithreading is such a model, mak- 
ing it a leading candidate for implementation in future-generation processors. 
In speculative multithreading, a processor is (logically) comprised of replicated 
processing elements that cooperate on the parallel execution of a conventional 
sequential program (also referred to as a conventional program thread) that has 
been divided into chunks called speculative threads. Speculation is a key element. 
Without speculation, programs can only be divided conservatively into threads 
whose mutual independence must be guaranteed. Speculation allows these guar- 
antees to be bypassed, producing much more aggressive divisions into threads 
that are parallel with high probability. 

2 Rationale for Speculative Multithreading 

The motivation for using speculative multithreading comes from two directions. 
On one hand, the potential for further increasing single-program performance 
using known parallelism extraction techniques is diminishing. On the other, tech- 
nology trends suggest processors that can execute multiple threads of code. These 
circumstances invite us to find those few innovations that will enable such mul- 
tithreaded processors to support the parallel execution of a single program. 

2.1 Limitations of Existing Techniques to Extract Parallelism 

We begin by briefly reviewing the functionality and high-level operation of the 
incumbent model for achieving high single-program performance — the super- 
scalar model. Imperative programs — programs written in imperative languages 
like Fortran, C, and Java — are defined by a static control flow in which individ- 
ual instructions read and write named storage locations. At runtime, a super- 
scalar processor unrolls the static control flow to produce a dynamic instruction 
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stream. The positions of reader and writer instructions in this stream defines the 
way data flows from one operation to another, i.e., the algorithm itself. A super- 
scalar processor creates a dynamic instruction window (an unrolled contiguous 
segment of the dynamic instruction stream) , repeatedly searches this window for 
un-executed, independent instructions, and attempts to execute these instruc- 
tions in parallel. Sustained high-performance demands that any given window 
contain a sufficient number of independent instructions, i.e., a sufficient level of 
instruction-level parallelism (1LP). 

Unfortunately, the way in which imperative programs are written makes con- 
sistently high ILP a rarity. In order to preserve their sanity, programmers struc- 
ture programs in certain ways, a basic technique being the static (and hence 
dynamic) grouping of dependent instructions. The spatial proximity of related 
statements helps programmers reason about programs in a hierarchical fashion 
but limits the amount of independent work that would available in a given win- 
dow of dynamic instructions. Optimizing compilers attempt to improve the situ- 
ation by transparently re-ordering instructions, mixing instructions from nearby 
program regions to improve the overall levels of window ILP. However, while 
very sophisticated, compiler scheduling is fundamentally limited by compilers’ 
inability to perfectly determine the original intent of the programmer and their 
commitment to preserve the high-level structure of the original program. 

The amount of parallel work being what it is, one option is to build a su- 
perscalar processor with an instruction window large enough to simultaneously 
contain code from different program regions (i.e., different functions or loop it- 
erations). However, even if such a machine could be built — and there are many 
engineering obstacles to doing so — there is fundamental problem in keeping a 
large, contiguous instruction window full of useful instructions. Specifically, the 
decreasing accuracy of a series of branch predictions leads to an exponentially 
decreasing likelihood that instructions at the tail of the window will be useful. 

Overcoming this problem requires a model that allows parallelism from dif- 
ferent program regions to be exploited in a reasonably independent (i.e., non- 
contiguous, non-serial) manner. Speculative multithreading is such a model. In 
speculative multithreading, each program region is considered to be a specu- 
lative thread , i.e., a small program. By executing multiple speculative threads 
in parallel, additional parallelism can be extracted (especially if each thread is 
mostly sequential). The threads are subsequently merged to recreate the origi- 
nal program. Speculative multithreading allows a large instruction window to be 
created as an ensemble of smaller instruction windows, thereby facilitating im- 
plementation. In addition, a proper thread division can logically isolate branches 
in one thread from those in another ra. relieving the fundamental problem of 
diminishing instruction utility. 

2.2 The Emergence of Multithreaded Architectures 

Multithreaded processors — processors that support the concurrent execution of 
multiple threads on a single chip — are beginning to look as if they will dom- 
inate the landscape of the next decade. Two multithreaded processor models 
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are currently being explored. Simultaneous multithreading (SMT) I5IZLL4H3-2I3.3.1 
uses a monolithic design with most resources shared amongst the threads. Chip 
multiprocessing (CMP) m proposes a distributed design (a collection of in- 
dependent processing elements) with less resource sharing. The SMT model is 
motivated by the observation that support for multiple threads can be provided 
on top of a conventional ILP (i.e., superscalar) processor with little additional 
cost. The CMP model is more conventionally motivated by design simplicity and 
replication arguments. Both models target independent threads (multithreaded 
a multiprogrammed workloads) and use multithreading to improve processing 
throughput. 

As technology changes, the distinction between the SMT and CMP microar- 
chitectures is likely to blur. Increasing wire delays will require decentralization 
of most critical processor functionality, while flexible resource allocation policies 
will enhance the appearance of (perhaps asymmetric) resource sharing. Regard- 
less of the specific implementation, multithreaded processors will logically appear 
to be collections of processing elements. The interesting question is whether this 
organization can be exploited to improve not only throughput but also the exe- 
cution time of a single program. Thread-level speculation is the key to enabling 
this synergy. In addition to executing conventional parallel threads, the logical 
processors could execute single programs that are divided into speculative threads. 
Speculative multithreaded processors will provide not only high throughput but 
also high single-program performance when needed. 

3 Dividing Programs into Multiple Threads 

There are several ways in which to divide programs into threads. We categorize 
these divisions as control-driven and data-driven depending on whether threads 
are divided primarily along control-flow or data-flow boundaries. Each division 
strategy can be further sub-categorized as either non-speculative — the threads 
are completely independent from the point of view of the processor and any 
dependence is explicitly enforced using architectural synchronization constructs, 
or speculative — the threads may not be perfectly independent, or synchronized, 
and it is up to the hardware to detect and potentially recover from violations of 
the independence assumptions. 

The threads obtained from a division of a program are expected to execute 
on different (logical) processing units. To achieve concurrency, proximal threads 
(i.e., threads that will simultaneously co-exist in the machine) need to be highly 
data- independent. If data-independence can be achieved, concurrency (and hence 
performance) can scale almost linearly with the number of threads even for small 
per-thread window sizes, and efficiency can be kept constant as bandwidth and 
(hopefully) performance are increased. We expect that speculation can allow 
data-independence criteria to be achieved more easily, giving speculative solu- 
tions distinct performance and applicability advantages over their more conven- 
tional non-speculative counterparts. 
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3.1 Control-Driven Threads 

Although the object of multithreading a program is to divide it into clata- 
independent (parallel) threads, the most natural division of an imperative pro- 
gram is along control-flow boundaries into control-driven threads. The archi- 
tectural semantics of imperative programs are control-driven: instructions are 
totally ordered and architectural state is precisely defined only at instruction 
boundaries. Control-flow is explicit while data-flow is implicit in the total or- 
der. In control-driven multithreading, the dynamic instruction stream is divided 
into contiguous segments that can subsequently be “sewn” together end-to-end 
to reconstruct the sequential execution. The challenge of control-driven multi- 
threading is finding division points that minimize inter-thread data dependences. 

We should note here that control-driven multithreading is not the same as 
parallel programming. Parallel programs do execute multiple concurrent control- 
driven threads, but these threads exchange data in arbitrary ways. The seman- 
tics of a parallel program is rarely the semantics of the individual threads run 
in series. In contrast, control-driven multithreading is a way of imposing par- 
allel execution on what is in essence a sequential program. Data flows between 
control-driven threads in one direction only, from sequentially “older” threads 
to “younger” ones. 

Non-speculative Control-Driven Threads. Without support for detecting 
and recovering from data-dependence violations or to abort unnecessary threads 
and discard their effects, non-speculative control-driven multithreading requires 
strict guarantees about the execution- certainty and data-integrity of threads. 
Execution-certainty requirements spawn from the fact that thread execution 
cannot be undone, and mean that non-speculative control-driven threads can 
only be forked if their execution is known to be needed. In order to maximize 
concurrency, execution certainty is usually achieved by forking a thread at a 
previous control-equivalent point, e.g., forking of a loop iteration at the be- 
ginning of the previous iteration. Data-integrity refers to the requirement that 
access to thread shared data must occur (or appear to occur) in sequential or- 
der. When we speak of data-integrity, we are mainly concerned with memory- 
integrity. Support for direct inter-thread register communication is typically not 
available. We assume that if it is provided then appropriate synchronization 
is provided along with it. In contrast, inter-tlrread memory communication is 
naturally available, meaning that access to any memory location that could 
potentially be shared with other threads must be explicitly synchronized. Of 
course, data-slraring/synchronization should be kept to a minimum to allow for 
adequate concurrency among threads. 

With such strict safety requirements, the division of a program into non- 
speculative threads has traditionally fallen into the realm of the programmer 
and compiler. The programmer has the deepest knowledge of the parallel di- 
mensions of his algorithm and the potential for clata-sharing among different 
divisions. However, performing thread division by hand is tedious, and manual 
attempts to minimize synchronization often lead to errors. In light of these dif- 
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Acuities, much effort has been placed into using the compiler to automatically 
multithread (parallelize) programs. Although (debugged) compilers don’t make 
errors, and compiler tedium is less of an issue than programmer tedium, compiler 
multithreading has had success only in very limited domains. 



Speculative Control-Driven Threads. Non-speculative control-driven mul- 
tithreading suffers from two major problems. First, execution-certainty require- 
ments limit thread division to control-independent program points, which may 
not satisfy the primary data-independence criteria. Second, even when proximal 
threads are data-independent, if this independence is improvable, then conser- 
vative synchronization must be used to guard against the unlikely (but remotely 
possible) case of a re-ordered communication. Where synchronization is need- 
lessly applied, concurrency and performance are unnecessarily lost. 

Speculation can alleviate these problems. In speculative control-driven mul- 
tithreading, memory does not need to be explicitly synchronized at all. The 
correct total order of memory operations can be reconstructed from the (ex- 
plicit or implied) order of the threads. This ordering can be used as the basis for 
hardware support to detect and potentially recover from inter-thread memory- 
ordering violations punj. With such support, access to thread shared data can 
proceed optimistically, with penalties incurred only in those cases when data is 
actually shared by proximal threads and the accesses occur in non-sequential 
order. Furthermore, since ordering violation scenarios are typically predictable, 
slight modifications to the basic mechanism allow it to learn to recognize these 
scenarios early and artificially synchronize the offending store/load pairs mu 

The execution-certainty constraints can be lifted using similar mechanisms. 
The ability to recover from inter-thread memory-ordering violations implies the 
presence of hardware that can buffer or undo changes to architected thread 
state. This support can be used to undo an entire thread, allowing threads 
to be spawned at points at which their final usefulness cannot be absolutely 
guaranteed, but where usefulness likelihood is high and the data-independence 
(parallelism) characteristics are more favorable. 

Speculative control-driven multithreading has been the subject of academic 
research in the 1990’s |1 161981 .211 6127129134^ and is slowly finding its way into 
commercial products. Sun’s MAJC architecture PH supports such threads, via 
its Space Time Computing (STC) model. More recently, NEC’s Merlot chip 
P! uses speculative control-driven multithreading to parallelize the execution 
of code that can’t be parallelized by other known means. We expect that more 
processors will make use of speculative control-driven threads in the coming 
decade, as this technology moves from the research phase into commercial im- 
plementations. 

3.2 Data-Driven Threads 

Where control-driven multithreading divides programs along control-flow bound- 
aries, data-driven multithreading uses data-flow boundaries as the major divi- 
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sion criteria. Such a division naturally achieves the desired inter-thread data- 
independence and resulting parallelism \wmmwm . 

Data-driven threads are almost ideal from a performance and efficiency stand- 
point. In its pure form, data-driven multithreading occurs at the granularity of a 
single instruction EE2- Data-driven instruction sequencing (i.e. , fetch) is trig- 
gered by the availability of one of its input operands. Instructions enter the 
machine as soon as they may be able to execute but no sooner. This arrange- 
ment maximizes the amount of work that may be used to overlap with long 
latency instructions, while not wasting resources on instructions that are not 
ready to use them. 

Instruction-level data-driven sequencing is not the only option. Data-driven 
sequencing may be used on a thread granularity with conventional, control-driven 
sequencing used at the instruction level jl 5|2fiTIDl| . In this organization, instruc- 
tions from one or several related computations are packed into totally-ordered 
threads that implicitly specify data-flow relationships. Individual threads are 
assigned to processing elements and sequenced and executed in a control-driven 
manner. However, the data-flow relationships between threads are represented 
explicitly and thread creation is triggered in a data-driven manner (i.e., by the 
availability of its data inputs from the outcome of a previous thread). The data- 
driven threads we expect to see in future processors are likely to be of this form. 



Non-speculative Data-Driven Threads. Non-speculative data-driven mul- 
tithreading is difficult to implement for imperative languages. The main barrier 
is the incongruity of the requirement of an explicit data-flow program repre- 
sentation and the reality that for imperative programs, data-flow information is 
often impossible to explicitly specify a priori even as it applies to a few well de- 
fined boundaries. A data-forwarding error, either of omission or false commission, 
changes the meaning of the program. The automatic conversion of imperative 
code to data-flow explicit form has been the subject of some research, but in 
general, data-driven program representations can only be constructed for code 
written in functional (data-driven) languages. 



Speculative Data-Driven Threads. Non-speculative data-driven multi- 
threading suffers from two major problems. First, programs can generally not be 
divided into data-driven threads. Second, even in cases where a division is possi- 
ble, the resulting representation breaks the sequential semantics created by the 
programmer and the correlation between the executing program and the source 
code from which it was derived. Sequential semantics (or at least their appear- 
ance) is very important for program development, debugging, and the interaction 
with non-data-driven system components and tasks. The loss of sequential se- 
mantics is more serious than simply being a disturbance to the programmer. 

Again, speculation is likely to be the key to solving these problems. How- 
ever, a shift in approach regarding the role of multithreading may be needed 
first. Two complementary observations guide this new approach. First, program 
development and debugging will probably require the presence of a “main” or 
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“architectural” thread whose execution will implement the sequential, control- 
driven semantics of the program. Second, programs inherently contain sufficient 
levels of ILP, but this ILP is hindered by long-latency microarchitectural events 
like cache misses and branch mis-predictions. The parallelism in the program 
can be extracted if these latencies — which are likely to get relatively longer — 
can be tolerated. These observations suggest a different role for multithreading, 
one which does not require dividing the program per se. Instead, the program 
is augmented with “helper” threads that run ahead and pre-execute or “solve” 
problem instructions before they have a chance to cause stalls in the “main” 
program thread. We believe that it is in this capacity, as high-powered “helper” 
threads, that speculative data-driven threads can best be used in an imperative 
context pmzcmm. 

In the “helper” model, selected computations are copied from the program 
and packed into data-driven threads IEE323- Now, the program is executed 
as a single control-driven thread, as usual. However, at certain points in the 
main program, data-driven threads are spawned in order to pre-execute the 
computation of some future problem instruction. When the main program thread 
catches up to the data-driven thread, it has the option of picking up the result 
directly |Sp2f ii'ilzTTI or simply repeating the work (albeit with a reduced latency) 

|H3j. 



The role of speculation is intermingled with the reduced “helper” status 
of data-driven threads. The fact that the control-driven thread is present and 
ultimately responsible for the architectural interface, immediately relieves data- 
driven threads from any correctness obligations. Without these obligations, data- 
driven threads can be constructed using whatever data-flow information is avail- 
able. In addition, they need not comprise a complete partitioning of the program; 
their use may be reserved only for those situations in which their parallelism- 
enhancing characteristics are most needed. 



4 Practical Aspects 

Whether future processors will also include support for speculative threads — 
either control-driven, data-driven or both - depends on the discovery of ac- 
ceptable solutions to several practical problems. These problems range from the 
low-level (i.e., how threads should be implemented) to the high-level (i.e., how 
threads should be used) and cover all levels in between. We briefly touch upon 
some of these issues in this section. 



4.1 System Architecture 

The broadest decision that needs to be made and the one that will have the most 
impact on other decisions is the division of labor and responsibilities between the 
programmer, compiler, operating system and processor. It is obvious that the 
processor will execute the threads. However, the answer to the question of what 
entity should be responsible for other thread-related tasks — from selecting the 
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threads themselves to spawning, scheduling, resource allocation and communi- 
cation — is not clear. Placing all of the responsibility on the processor is one 
attractive option. With near-future processors having nearly one billion tran- 
sistors, a few million can be dedicated to multithreading-specific management 
tasks (perhaps as a separate co-processor) . A processor-only implementation has 
no forward or backward compatibility problems, it preserves the current system 
interface, while enhancing the performance of legacy software. Its drawbacks are 
added design complexity and the mandate rigidity and simplicity of the thread 
selection and management algorithms. 

Since thread-selection is such an important and delicate problem, it seems 
logical to push at least that function to software or perhaps even the programmer. 
Thread-selection algorithms implemented in software can be more sophisticated 
and may produce better thread divisions. Thread divisions chosen by the pro- 
grammer — who understands the program at its highest, algorithmic levels — 
and subsequently communicated to the compiler, may be better still. However, 
any path in which multithreading information flows from or through software to 
the hardware requires a change in the software/hardware interface. Such changes 
are typically met with some resistance, especially if they have architectural se- 
mantics that need to be implemented. 

Our expectation is that speculative thread information is likely to be con- 
veyed from software to hardware, but in an advisory form. An example of advi- 
sory information are prefetch instructions that are found in many recent archi- 
tectures. The understanding is that the hardware may act upon this informa- 
tion either fully, partially or selectively, or even ignore it altogether, all without 
impacting correctness. The option to enhance or refine this information dynam- 
ically is left to the processor as well. Restricting speculative thread information 
to an advisory role relieves the architect from many functionality guarantees 
that would hamper future generation implementations. 

4.2 Specific Hardware Support 

For the full power of speculative multithreading to be realized, hardware support 
is required. Specifically, threads need to be made “lightweight” with mechanisms 
for fast thread startup and inter-thread communication and synchronization. 
Hardware support for speculation includes buffering for speculative actions and 
facilities for fast correct-speculation state commit and, likewise, mis-speculation 
recovery. The precise support required for control-driven and data-driven threads 
is somewhat different. An additional challenge is to provide this support, as 
well as support for conventional parallel threads, using a uniform set of simple 
mechanisms. 

One apparent requirement for the implementation of lightweight threads 
(speculative and otherwise) is a mechanism for passing values from one thread 
to another via registers. Memory communication and synchronization is likely 
to be reasonably fast on a speculatively multithreaded processor, since the bulk 
of it will occur through the highest level of shared on-chip cache. However, a 
register path for communication and synchronization is likely to be faster still. 
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Inter-thread register communication will also allow thread register contexts to 
be initialized quickly, accelerating thread start-up. 

We assume that the register-communication mechanism will implement inter- 
thread register synchronization. Another requirement is a mechanism for enforc- 
ing correct ordering of memory operations from different threads. At a high 
level, such a mechanism would buffer loads from young threads and compare 
them with colliding stores from older threads. Designs for inter-thread memory 
ordering mechanisms are known in both centralized m and distributed forms 
m- The distributed form uses a modified cache-coherence protocol that blends 
naturally with the protocol that implements general data-sharing for parallel 
threads. We expect this form to find widespread use in future processors. 

5 Summary 

Future processors will be comprised of a collection of logical processing elements 
that will collectively execute multiple program threads. To overcome the limi- 
tations in dividing a single program into multiple threads that can execute on 
these multiple logical processing elements, speculation will be used. A sequen- 
tial program will be “speculatively parallelized” and divided into speculative 
threads. Speculative threads are not only a good match for the microarchitec- 
tures that are likely to result as technology advances, they have the potential to 
overcome the limitations of currently-known methods to extract instruction-level 
parallelism. 

There are two main types of speculative threads that we expect to be used: 
control-driven and data-driven threads. Speculative control-driven threads have 
already begun to appear in commercial products (e.g., Sun’s MAJC and NEC’s 
Merlot), while speculative data-driven threads are still in the research phase. 

Several technologies will have to be developed before speculative multithread- 
ing is commonplace in mainstream processors. These include means for conveying 
thread information from software to hardware, algorithms for thread selection 
and management, and hardware and software to support the simultaneous exe- 
cution of a collection of speculative and non-speculative threads. Consequently 
we expect the next decade of processor development to be at least as exciting 
as previous decades. 
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Abstract. In this paper, we propose a Barrier Tree for Irregular Net- 
works ( BTIN) and a barrier synchronization scheme using BTIN for 
switch-based cluster systems. The synchronization latency of the pro- 
posed BTIN scheme is asymptotically O(logn) while that of the fastest 
scheme reported in the literature is bounded by O(n), where n is the 
number of member nodes. Extensive simulation study shows that, for 
the group size of 256, the BTIN scheme improves the synchronization 
latency by a factor of 3.3 ~ 3.8, and is more scalable than conventional 
schemes with less network traffic. 

Index terms: Cluster systems, irregular networks, barrier synchroniza- 
tion, communication latency. 



1 Introduction 

Switch-based cluster systems have been widely accepted as cost-effective alterna- 
tives for high performance computers. Since computational nodesQ or switches 
may be added to or detached from the network dynamically, it is generally 
assumed that the switches form an irregular topology 0121 . Distributed recon- 
figuration algorithms identify the network topology before computation begins 
0. Here the irregular topology makes routing difficult to avoid deadlock among 
multiple packets traveling simultaneously. For example, up/down routing algo- 
rithm 0 prevents deadlock by restricting the sequences of turns in the routing 
paths. Collective operations need more attention than the operations with point- 
to-point communication since they often determine the execution time of the 
sequential part of a parallel program, which usually constitutes the bottleneck. 

1 Nodes, in this paper, actually mean PCs or workstations in a cluster system. 
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Such collective operations, however, become more complicated for switch-based 
cluster systems m due to the irregularity mentioned above. 

A barrier is a synchronization point in a parallel program at which all pro- 
cesses participating in the synchronization must arrive before any of them can 
proceed beyond the synchronization point. In general, barrier synchronization is 
split into two phases - reduction and distribution. During the reduction phase, 
each participating process notifies the root process of its arrival at the barrier 
point. Upon the notification from all member processes, the distribution phase 
begins and the root process notifies them that they can proceed further. While 
software barriers inherently suffer from large communication latency, hardware- 
supported barriers are usually an order of magnitude faster than software bar- 
riers (BJ. For switch-based irregular networks, even though hardware-supported 
multicast has been extensively studied [Z03j, there have been, to the authors’ 
knowledge, no works devoted to hardware-supported barrier synchronization. 

In this paper, we propose a Barrier Tree for Irregular Networks ( BTIN) 
for the switch-based cluster systems of irregular topology, which significantly 
reduces the synchronization latency and network traffic with no deadlock. It is a 
tree-based combining scheme which constructs a barrier tree and embeds it into 
the corresponding switches by putting special registers into the switches. The 
synchronization latency of the BTIN scheme is 0(log n) while that of the fastest 
scheme reported in the literature is bounded by O(n), where n is the number 
of member nodes. Extensive simulation study shows that, for the group size of 
256, the BTIN scheme improves the synchronization latency by a factor of 3.3 
~ 3.8, and is more scalable than conventional schemes with less network traffic. 

The rest of the paper is organized as follows. Switch-based cluster systems 
of irregular topology, the construction of a BTIN, and the corresponding switch 
operations are presented in the following section. Section 3 is devoted to analyz- 
ing the characteristics of the BTIN scheme including tree height and deadlock 
issue. The performance of the proposed scheme is evaluated and discussed in 
Section 4, and conclusions and future works are covered in Section 5. 

2 Tree-Based Barrier Synchronization 

We first introduce switch-based cluster systems of irregular topology. Then, the 
proposed BTIN and the corresponding operations are presented. 

2.1 Switch-Based Cluster Systems 

An example of switch-based cluster systems with irregular topology is drawn in 
Fig. 1. Each switch has a set of ports and each port is connected to a computa- 
tional node or other switch. Some ports may be left open and can be used for 
further system expansion. In this example, the problem is to synchronize the 14 
member nodes in a process group (the dark circles in Fig. 1) at a barrier point, 
which are selected for running a parallel application. Without loss of generality, 
in this paper, we consider wormhole routing pn with input and output buffers 
of one flit wide each. 
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Hardware support for barrier synchronization is the barrier registers within 
switches. There will be a register assigned for each barrier, and similar concept 
has been assumed in the switch-based multicast approaches El, where a pro- 
cessor can access a register within a switch. We assume that the barrier registers 
can hold an entire synchronization message. This can be justified by the fact 
that a synchronization message is very short and fixed in length since it needs 
not carry multiple destination addresses, and the size of synchronization data is 
quite small. 

Collective operation primitives including barrier synchronization are included 
in most message-passing libraries. Among them, we target our discussion to MPI 
( Message Passing Interface) standard m However, the algorithms presented 
here can be applied to other message-passing systems with little modifications. 




Fig. 1. A switch-based cluster system 
with irregular topology. 




□ switch O nonmember node O member node 

Fig. 2. A BTIN constructed from the 
cluster system in Fig. 1 



2.2 Barrier Tree for Irregular Networks (BTIN) 

In this paper, we define a member switch as a switch with at least one member 
node and a nonmember switch as one without any member node. Also note 
that a representative member node is defined as the member node attached to 
a member switch via the lowest numbered port. Given a root switch, a breadth- 
first spanning ( BFS ) tree is constructed by a distributed algorithm around the 
root switch. At the group creation time, the representative member node starts 
the algorithm to figure out the BFS topology, and then the nodes running the 
algorithm eventually agree on a unique BFS tree. Once a BFS tree is found, the 
algorithm checks whether there is any nonmember leaf switch in the tree. If such 
a leaf switch exists, it is removed from the BFS tree. Then, each representative 
member node sets up a barrier register in the corresponding switch properly to 
embed the resulting BTIN into the network. 

Fig. 2 shows a BTIN at the distribution phase, which is constructed from 
the cluster system in Fig. 1 and contains the same 14 member nodes. The root 
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switch is the switch labeled 3, and the root node is the node (3,4). Note here 
that the node notation (3, 4) represents the node attached to port 4 of the switch 
3. In Fig. 2, the distribution message follows the arrows in accordance with the 
BTIN routing algorithm. Note that the nonmember switches are not included in 
the tree unless they are intermediate switches in the tree. 

Root Switch and Root Node 

The root switch of a BTIN must be a member switch and is chosen so that 
the resulting BTIN has a minimum tree height among all possibilities. If two or 
more BFS trees have the same minimum height, one with the minimum number 
of edges is chosen. However, if two or more BFS trees have the same number 
of edges, one with the minimum number of leaves is selected. Finding the root 
switch is performed by every representative member node. 

In the example of Fig. 2, the root switch is the switch labeled 3, and the node 
(3,4) is selected as the root node since it is connected via the lowest numbered 
port (port identifier of 4) between the two member nodes. Note here that, unlike 
the up/down routing tree, edges between siblings are not permitted in BTIN. 

Switch Setup 

Each barrier register contains a group identifier (GID), a parent port number 
(P), parent and children bits, arrival bits for children, and synchronization data 
as shown in Fig. 3. Unlike other fields, arrival bits and message fields are used 
when the synchronization message is processed rather than at the initial setup 
time. For example, Aq identifies that a reduction message has arrived from the 
child switch connected to Cq during the reduction phase. 



Group Id 
(GID) 


P 




C 6 


<4 


C 4 


q. 


<4 


C 1 


c o 




\ 


\ 


A 4 


A, 


\ 


A 1 




Message 



P: Port number for parent C ; : Parent or children for port i A / : Arrived from C ; - 

Fig. 3. Structure of a barrier register. 



Below we describe the distributed algorithm to setup the barrier register, 
which every representative member node runs at the BTIN construction time. 
A special operation is required to setup a barrier register in intermediate non- 
member switches involved in BTIN as described in step 6. 



Setup Register^, M, s r , m r , st, mi , GID) 

1. Let S = {sq, si, • • • , s g _i} be all the switches, M = {mo, m i, • • • , 

be the addresses of member nodes, s r be the root switch, m r be the root 
node, Si be the local switch, mi be the local node, and GID be the group 
identifier. 

2. If the local node mi is not the representative member node of the local switch 
Si, return. 
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3. Around the root switch s r , establish the corresponding BFS tree, scanning 
switches in order of switch identifier (address). 

4. Remove nonmember leaf switches from the found BFS tree until there is no 
such a nonmember leaf switch, making the tree become a BTIN. 

5. Write the GID, the parent port number (P), and the parent and children 
bits into a barrier register in the local switch si . 

6. If there exists any nonmember descendant switch which can be reached with- 
out passing through intermediate member switches, request a node of the 
nonmember descendant switch to setup a barrier register in the descendant 
switch by transmitting a point-to-point message. Then, the destinated node 
will write the GID, the parent port number (P), and the parent and children 
bits into a barrier register in the corresponding switch. Repeat this step 6 
until there is no such a nonmember descendant switch. 



2.3 Barrier Synchronization Using BTIN 

Fig. 4 shows the format of synchronization message which contains message 
type, group identifier, and small synchronization data. For the example shown 
in Fig. 1, a synchronization message may comprise at most two bytes, i.e., 2-bit 
message type, 8-bit group identifier for at most 256 different groups, and at most 
6-bit synchronization data if any. 



Message 

type 



Group id. 



Synchronization 

data 



Fig. 4. Format of the barrier synchronization message. 



The BTIN routing or collective routing performs message merging and repli- 
cation at the reduction and distribution phase, respectively. Collective merging 
and replication are carried out at the switches. During the reduction phase, 
the reduction messages traverse in the up direction upward the root switch, be- 
ing combined collectively at each branch switch. During the distribution phase, 
the distribution messages traverse in the down direction downward all the leaf 
switches, being replicated at each branch switch. 

3 Characteristics of BTIN 

Characteristics of BTIN are analyzed in this section. Then, we compare them 
with those of conventional approaches. 

3.1 Tree Height of a BTIN 

We define connectivity , or connection ratio, f of fc-port switches as the ratio 
of the average number of connected ports over k |Zj. Hence, fk is the average 
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number of ports in a switch, which are connected to either other switches or 
computational nodes. 

Synchronization latency is linearly proportional to the height of BTIN. We, 
thus, analyze the average tree height of a BTIN which is established on a ran- 
domly built irregular network. Without loss of generality, we assume that all the 
possibilities of network configuration with fc-port switches of connectivity / are 
equally likely. The following Theorem 1 formally analyzes the average height of 
a BTIN under the assumption. (See m for a complete proof.) 

Theorem 1: For an irregular network with k-port switches of connectivity f, 
the average height, h, of a BTIN is asymptotically given by h = log (fk_p/ q _i) n > 
where n is the number of member nodes, p is the number of nodes, and q is the 
number of switches. 

According to Theorem 1, the average height of a BTIN is 0( log^ k _ p / q _ 1 - ) n ). 
It is simply rewritten by O(logn) because fk—^ — 1 becomes a constant. Hence, 
the associated synchronization latency has a time complexity of O(logn). 

3.2 Deadlock Freedom 

In irregular networks as well as regular ones, a major issue with the wormhole 
routing is deadlock m- When the path of a message is blocked, the message head 
as well as the rest of the message are stopped where they are, holding the buffers 
and channels along the path. Deadlock could occur if these stoppages create a 
cyclic dependency. However, if the message size is small enough, deadlock could 
be easily avoided by holding the entire message in the switch. In our barrier 
synchronization scheme, the synchronization messages need not carry all the 
destination addresses, and thus the lengths are identical and very small. 

The basic technique for proving that a network is deadlock-free is to artic- 
ulate the dependences that can arise between channels as a result of message 
movement, and to demonstrate that there exists no cycle in the resulting chan- 
nel dependence graph. This implies that no traffic patterns can lead to dead- 
lock, where the traffic patterns include those incurred by three cases; a barrier 
synchronization, multiple concurrent synchronizations, and a mixture of syn- 
chronization messages and normal messages. See for a complete discussion 
on how messages incurred by the above three cases do not create a deadlock 
situation. 

3.3 Comparison of Characteristics 

In this subsection, we compare the characteristics of BTIN with those of two con- 
ventional approaches, the method using point-to-point messages and the method 
using switch-based multicast at the distribution phase. For simplicity, in this pa- 
per, we call the two approaches the unicast scheme and the multicast scheme, 
respectively. The comparisons are summarized in Table El As can be seen from 
the table, the proposed BTIN possesses preferable characteristics for all the fac- 
tors studied, which results in significantly better performance. 
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Table 1 . Characteristics of barrier synchronization schemes on irregular networks. 





Unicast scheme 


Multicast scheme 


BTIN scheme j 


Initialization at group creation time j 


Routing path/tree 
construction 


Centralized at the 
root 


Centralized at the 
root 


Distributed at all 
members 


Router setup 


None 


At member nodes 
(for distribution) 


At member nodes 


During a barrier operation j 


Hardware support 


None 


Distribution phase 


Reduction and distri- 
bution phase 


Synchronization 
message size 


Short (single desti- 
nation address) 


Long (multiple desti- 
nation addresses) 


Short (no destina- 
tion address) 


Number of start- 
ups 


2n (n for reduction 
and n for distribu- 
tion) 


n + 1 (n for reduction 
and one for distribu- 
tion) 


2 (one for reduction 
and one for distri- 
bution) 


Complexity of 

routing latency 


0(n) 


0{n) 


O(logn) 


Primary weakness | 


Primary weakness 


Repetitive 2 n unicast 
transfers (very slow) 


Repetitive n unicast 
transfers and hard- 
ware for multicast 


Hardware complexity 
at the router 




Hardware complexity 
at the router 





The unicast scheme requires repetitive n point-to-point message transfers 
for each of reduction and distribution phase, and thus it has the complexity of 
0(2n) for routing latency, which is simply rewritten by 0(n). We assume that, in 
the multicast scheme, switch-based multicasting tree is used for the distribution 
phase in hardware level. In the multicast scheme, it is simple to see that the 
complexity of routing latency of the distribution phase is O(logn). However, for 
the reduction phase, repetitive n unicast message transfers are required, resulting 
in the complexity of O(n). Hence, the complexity of the multicast scheme is 
0(n + log?t), which is simply rewritten by 0(n). 

4 Performance Evaluation 

For different system configurations, the performance of the BTIN scheme is 
evaluated and compared to that of the multicast scheme, which is the most 
recent and efficient approach, using simulation. 

4.1 Simulation Environment 

We evaluate the performance of the proposed tree-based barrier synchronization 
scheme on two different system configurations; (i) 256 nodes and 75 switches and 
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(ii) 1024 nodes and 300 switches. We assume that the network is interconnected 
with 8-port switches having 75% connectivity. The member nodes are picked 
randomly and all the members are assumed to arrive at the barrier at the same 
time. Channel contention is not considered. 

The synchronization latency is the most important performance metric of 
barrier synchronization, which is the interval from the time when the barrier 
synchronization is invoked until the time when all the member nodes finish 
the distribution phase. As another performance measure in our simulation, the 
network traffic incurred by the barrier synchronization is also investigated. This 
is measured by the number of links (hops) traversed by the synchronization 
messages during a barrier operation. 

The default performance parameters have been assumed on the basis of 
overhead- minimized communication on advanced switches. We assume the fol- 
lowing default performance parameters: communication startup time (t s ) of 
2 ~ 10 psec, link propagation delay (t p ) of 20 ~ 40 nsec , and switch (router) de- 
lay (t r ) of 300 ~ 500 nsec. The startup time includes the software overheads for 
allocating buffers, copying messages, and initializing the router and DMA HSj. 
The router delay includes several steps of complicated operations and varies for 
various routing algorithms as Chien ra analyzed. We also assume that the net- 
work interface delay is almost the same as the switch delay for our evaluation. 
Since synchronization messages do not need any data flits, the communication 
latency of a message transfer can be approximated to t s + d-t p + (d- 1-1) • t r , where 
d is the distance between the source and destination node in a communication. 



4.2 Simulation Results and Discussion 
Synchronization Latency 

Fig. 5 shows the synchronization latency, where t s , t p , and t r are assumed to 
be 2.0 psec, 20.0 nsec, and 300.0 nsec, respectively. Here for each parameter 
set, 100 simulation runs are executed, and the results are averaged. In most of 
the cases, very small variance is observed. Both the number of nodes and the 
number of switches are shown in the parenthesis of labels in Fig. 5, where the 
connectivity / is 75%. 

The synchronization latency of the BTIN scheme is significantly lower than 
that of the multicast scheme. Observe from the figure that the synchronization 
latency of the BTIN scheme is almost independent on the group size except for 
very small groups. This is mainly due to the fact that the tree height of BTIN 
is bounded by log (f k _ p / q _ 1 ~ ) q on an irregular network with fc-port switches of 
connectivity /, where p is the number of nodes and q is the number of switches. 
In a system of 1024 nodes and 300 switches, the group size can be more than 256. 
Even though it is not shown here, we increased the group size up to 1024. Then 
the synchronization latency converges to 11.5 psec and 120.5 psec for BTIN and 
multicast scheme, respectively. As shown in the figure, for the group size of 256, 
the BTIN scheme is faster than the multicast scheme by factors of 3.8 and 3.3 
for the system of 256 nodes and 75 switches and the system of 1024 nodes and 
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300 switches, respectively. From the figure, it is obvious that the BTIN scheme 
is more scalable than the multicast scheme. 




Fig. 5. Synchronization latency. 



Fig. 6. Network traffic. 



Network Traffic 

As shown in Fig. 6, the network traffic of the BTIN scheme is significantly 
lighter than that of the multicast scheme. The performance of network traffic 
is more improved as the network size increases. For instance, for the group size 
of 256, the network traffic of the BTIN scheme is significantly lighter than the 
multicast scheme by factors of 46.8 and 88.6 for the system of 256 nodes and 
75 switches and the system of 1024 nodes and 300 switches, respectively. As 
the group size is increased, the network traffic is also increased for both of the 
schemes because more nodes and switches of the network are involved in a barrier 
synchronization. The proposed BTIN scheme is clearly more scalable than the 
multicast scheme even in terms of network traffic. See m for more simulation 
results and discussion. 



5 Conclusions 

In this paper, we proposed a fast tree-based barrier synchronization scheme for 
switch-based irregular networks, which is, to the authors’ knowledge, the first 
approach to hardware support for barrier synchronization on irregular networks. 
The BTIN tree is at most ( k — l)-ary, and the complexity of synchronization 
latency is O(logn) while that of the fastest scheme, which is the method using 
switch-based multicast at the distribution phase, is bounded by O(n), where n 
is the number of member nodes. The proposed BTIN scheme has been analyzed 
and compared with the conventional schemes. Extensive simulation study shows 
that, for the group size of 256, the BTIN scheme improves the synchronization 
latency by a factor of 3.3 ~ 3.8, and is more scalable than conventional schemes 
with less network traffic. 

We are currently investigating the application of the BTIN scheme to other 
collective communications such as multicast or total exchange. It is also an in- 
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teresting subject to consider the BTIN scheme for dynamic environment caused 

by load balancing and node /link failures. 
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Abstract: This paper presents the results from running five experiments with 
the Chime Parallel Processing System. The Chime System is an implementation 
of the CC++ programming language (parallel part) on a network of computers. 
Chime offers ease of programming, shared memory, fault tolerance, load bal- 
ancing and the ability to nest parallel computations. The system has perform- 
ance comparable with most parallel processing environments. The experiments 
include a performance experiment (to measure Chime overhead), a load bal- 
ancing experiment (to show even balancing of work between slow and fast ma- 
chines), a fault tolerance experiment (to show the effects of multiple machine 
failures), a recursion experiment (to show how programs can use nesting and 
recursion) and a fine-grain experiment (to show the viability of executions with 
fine grain computations. 



1. Introduction 

This paper describes a series of experiments to test the implementation, features and 
performance of a parallel processing system called Chime. The experiments include 
runs of various scientific applications. Chime is a system developed at Arizona State 
University [1,2] for running parallel processing applications on a Network Of Work- 
stations (the NOW approach). 

Chime is a full implementation of the parallel part of Compositional C++ (or 
CC++) [3], running on Windows NT. CC++ is a language developed at Caltech and is 
essentially two languages in one. It has two distinct subparts - a distributed pro- 
gramming language designed for NOW environments and a parallel programming 
language designed for shared memory multiprocessor environments. While shared 
memory multiprocessors are very good platforms for parallel processing, they are 
significantly costlier than systems composed of multiple separate computers. Parallel 
CC++ is an exceptionally good language, but it was not designed to run on NOWs. 
Chime solves this problem, by implementing parallel CC++ on the NOW architecture. 

Chime uses a set of innovative techniques called “two-phase idempotent execution 
strategy’’ [4], “distributed cactus stacks” [1], “eager scheduling” [4], “dependency 
preserving execution” [2] and the well-known technique called “distributed shared 

M. Valero, V.K. Prasanna, and S. Vajapeyam (Eds.): HiPC 2000, LNCS 1970, pp. 283-292, 2000 
© Springer-Verlag Berlin Heidelberg 2000 
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memory” [5] to implement parallel C++. In addition, it has the extra features of load 
balancing, fault tolerance and high performance. Chime is the first (and of this writ- 
ing, the only) parallel processing system that provides the above features, coupled 
with nested parallelism, recursive parallelism and synchronization (these are features 
of parallel CC++). This paper described the implementation of Chime, in brief and 
presents details on the experiments with Chime. 



2. Related Work 

Shared memory parallel processing in distributed systems is limited to a handful of 
Distributed Shared Memory (DSM) systems that provide quite similar functions (Mu- 
nin [6], Midway [7], Quarks [8], TreadMarks [9]). DSM systems care categorized by 
the type of memory consistency they provide. DSM systems do not provide a uniform 
view of memory i.e. some global memory is shared and some are not. In addition, the 
parallel tasks execute in an isolated context; i.e. they do not have access to variables 
defined in the parent's context. In addition, a parallel task cannot call a function that 
has an embedded parallel step (nesting of parallelism is not allowed). 

The Calypso system [4, 17, 18] adds fault tolerance and load balancing to the DSM 
concept, but suffers from the lack of nesting and synchronization (except barrier 
synchronization). Chime is an extension to Calypso and absolves these shortcomings. 

A plethora of programming systems for NOW based systems exist, that uses the 
message-passing technique. Two well-known systems are PVM [10] and MPI [11]. 
Fault tolerance and load balancing has been addressed, in the context of parallel proc- 
essing by many researchers, a few examples are Persistent Linda [12], MPVM [13], 
Dynamic PVM [14] Piranha [15] and Dome [16]. The techniques used in most of 
these systems are quite different from ours and often add significant overhead for the 
facilities such as fault tolerance. 

Most working implementations are built for the Unix platform (including Linux). 
Some have Windows NT implementations, but they are buggy at best. In our experi- 
ence, we have not been able to make any non-trivial applications work correctly with 
these systems on the Windows platform. For this reason, we are unable to provide 
comparative performance tests. 



3. Chime Features 

Shared memory multiprocessors are the best platform for writing parallel programs, 
from a programmer’s point of view. These platforms support a variety of parallel 
processing languages (including CC++) which provide programmer-friendly con- 
structs for expressing shared data, parallelism, synchronization and so on. However 
the cost and lack of scalability and upgradability of shared memory multiprocessor 
machines make them a less than perfect platform. Distributed Shared Memory (DSM) 
has been promoted as the solution that makes a network of computers look like a 
shared memory machine. This approach is supposedly more natural than the message 
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passing method used in PVM and MPI. However, most programmers find this is not 
the case. The shared memory in DSM systems does not have the same access and 
sharing semantics as shared memory in shared memory multi-processors. For exam- 
ple, only a designated part of the process address space is shared, linguistic notions of 
global and local variables do not work intuitively, parallel functions cannot be nested 
and so on. 

As stated before. Chime provides a multiprocessor-like shared memory program- 
ming model on network of workstations, along with automatic fault-tolerance and 
load balancing. Some of the salient features of the Chime system are: 

1. Complete implementation of the shared memory part of the CC++ language. 
Hence programming with Chime is easy, elegant and highly readable. 

2. Support for nested parallelism (i.e. nested barriers including recursion) and syn- 
chronization. For example, a parallel task can spawn more parallel tasks and tasks 
can synchronize amongst each other. 

3. Consistent memory model, i.e. the global memory is shared and all descendants 
share the local memory of a parent task (the descendants execute in parallel). 

4. Machines may join the computation at any point in time (speeding up the compu- 
tation) or leave or crash at any point (slowdowns will occur). 

5. Faster machines do more work than slower machines, and the load of the machines 
can be varied dynamically (load balancing). 

In fact, there is very little overhead associated with these features, over the cost of 
providing DSM. This is a documented feature (see section 6.1) that Chime shares 
with its predecessor Calypso [4], Chime runs on Windows NT and the released ver- 
sion can be downloaded from http://milan.eas.asu.edu. 



3.1 Chime Programming Example 

Chime provides a programming interface that is identical to the parallel part of Com- 
positional C++ (or CC++) language Consider the following parallel CC++ program: 

[(include <iostream.h> 

((include "chime. h" 

((define N 1024 

int GlobalArray [N] ; 
void AssignArray ( int from, to) { 
if (from != to) 
par { 

AssignArray ( from, (from+to) / 2 ) ,- 
AssignArray ( (from+to) /2 + 1, to); 

} 

else GlobalArray [from] = 0; 

} 

int main(int argc, char *argv[]) { 

AssignArray (0, N-l) 

} 
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The above program defines a global (shared) array called GlobalArray, containing 
1024 integers. Then it assigns the global array using a recursive parallel function 
called AssignArray. The AssignArray function uses a "par" statement. The par state- 
ment executes the list of statements within its scope in parallel, thus calling two in- 
stances of AssignArray in parallel. Each instance calls two more instances, and this 
recursion stops when 1024 leaf instances are running. 



4. Chime Technologies 

The implementation of Chime borrows some techniques used in an earlier system 

called Calypso and adds a number of newer mechanisms. The primary mechanisms 

used in Chime are: 

Eager Scheduling: In Chime, the number of parallel threads running a parallel appli- 
cation can change dynamically and is larger than the number of processors used. 
Each processor runs a “worker” process, and one designated processor runs the 
manager. A worker contacts the manager and picks up one thread and when it 
finishes, it requests the next thread. Threads are assigned from the pool of uncom- 
pleted jobs. This technology provides load balancing (faster workers do more 
work) and fault tolerance (failed workers do not tie up the system) using the same 
technique. 

Two Phase Idempotent Execution Strategy (TIES): Since there is the possibility of 
multiple workers running the same thread, the execution of each thread must be 
idempotent. The idempotence is achieved by coupling eager scheduling with an 
atomic memory update facility implemented by Calypso DSM (see below). 

Calypso DSM: This is a variant of the well-known RC-DSM (Release Consistent 
Distributed Shared Memory) technique. RC-DSM is modified so that the return of 
pages are postponed to the end of the thread, and the manager buffers all the re- 
turned pages and updates them in an atomic fashion and then marks the thread as 
completed. This ensures correct execution even when threads fail at arbitrary 
points [17]. 

Dependency Preserving Execution: Threads can create threads; threads can synchro- 
nize with other threads. This can cause unmanageable problems when multiple 
workers are executing the same thread or when threads fail after creating new 
threads (or fail after reaching a synch point). Dependency Preserving Execution 
solves this problem. Each time a thread created nested threads, it informs the 
manger of the new threads and the old thread mutates into a new thread itself. 
Similarly at synchronization points, the manger is informed, and a mutation step is 
performed. The complete description of this mechanism is beyond the scope of 
this paper and is described in [2], 

Distributed Cactus Stack: A data structure that replicates the application stack 
amongst all machines to ensure correct nesting or parallel threads and scooping of 
variable local to functions [1], 
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5. Experiments with Chime 

We now describe a set of five experiments using various scientific applications to 
determine its performance and behaviors on a range of features. The five experiments 
shown below are the performance experiment, the load balancing experiment, the 
fault tolerance experiment, the recursion experiment and the fine grain execution 
experiment. Experiments were conducted at different points in time, at different loca- 
tions by different people, hence all the equipment used are not the same (except that 
Intel machines with Windows NT 4.0, connected by a 100Mbps Ethernet was used for 
all experiments). The systems used are stated along with the discussion of each ex- 
periment. 



5.1 Performance Experiment 

The performance experiments used 
several matrix multiply and ray- 
tracing programs, and both yielded 
similar results. We show the results of 
a ray-tracing program below using 
Pentium Pro 200 processors. The first 
step is to write the program in se- 
quential C++ and measure its execu- 
tion time. Then a parallel version is 
written in CC++ and run with Chime 
on a variable number of processors 
(from 1 to 5) and the speedups are 
calculated in respect to the sequential 
program. The results are shown in 
Figure 1. 

Note that the single processor exe- 
cution under chime is about 9% poorer than the sequential program and the execution 
speed scales with addition of processors - the degradation in performance is at most 
21%. This makes Chime competitive with most parallel processing systems for 
NOWs even though Chime has significantly better features. This experiment shows 
that the overhead of Chime is quite small, in spite of its rich set of features. We have 
been unable to run (in spite of extensive attempts) any complicated programs with 
Windows NT implementations of systems such as PVM and MPI and hence cannot 
provide comparative performance numbers. 



]Time - - ♦ - Ideal * Speedup 




Fig- 1- Performance Experiment 



5.2 Load Balancing Experiment 

The load balancing experiment involves the same ray-tracing program as above, but 
using machines of different speeds to run the parallel application. In many parallel- 
processing systems, the slowest machine(s) dictate performance; that is, fast machines 
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are held up for the slow machines to finish 
the work allocated to them. Chime, does it 
differently. The application was executed on 
4 slow machines (P-133, 64MB) and then a 
fast machine (P-200, 64MB) replaced one 
slow machine. This caused an increase in 
speed. Replacing more slow machines with 
fast machines kept the trend. 

We calculate an "ideal” speedup of the 
program, as follows. Suppose a computation 
runs for T seconds on n machines, M,, M 2 , 
..., M n . Machine M : has a performance index 
of P; and is available for t seconds. (Per- 
formance index is the relative speed of a 
machine normalized to some reference.) 
Then the maximum theoretical speedup that 
may be achieved is: 

Ideal speedup 

= number of equivalent machines 



]Time - - ♦ - Ideal A Speedup 




Number of Transients 

Fig. 2. Load Balancing Experiment 



The performance index of a P-133 was set to 1 and the P-200 was measured to be 
1.96. Note that the load balancing experiment shows that the actual speedup is close 
to the ideal speedup (within 22%) and the load is balanced well among slow and fast 
machines. 



5.3 Fault Tolerance Experiment 

The “fault tolerance” experiment, using the ray tracing program, shows the ability of 
Chime to dynamically handle failures as well as allowing new machines to join the 
computation. For this test up to four P-200 machines were used. Of these machines, 
one was a stable machine, and the rest were transient machines. The transient ma- 
chines worked as follows: 

Transient machine : After 120 seconds into the computation, the tran- 
sient machine joins the computation and then, after another 120 sec- 
onds, fails (without warning, i.e. crashes). 

Figure 3 shows the effect of transient machines. The actual speedups and ideal 
speedups were computed according to the formula described earlier. Note that the 
ideal speedup measure takes into account the full power of the transient machines 
during the time they are up whether they are useful to the computation or not. The ex- 
perimental results show that the transient machines do contribute to the computation. 
Note that the transient machines end their participation by crashing. Hence whatever 
they were running at that point is lost. Such crashes do not affect the correct 
execution of the program under Chime. 
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In fact this experiment shows the real power of Chime. The system handles load 
balancing and fault tolerance, with no additional overhead. The real speedups are 
close to the ideal speedups. In cases where 
machines come and go, the failure tolerance 
features of Chime actually provide more 
performance than an equivalent, non-fault- 
tolerant system. 



5.4 Nesting and Recursion Experiment 



The nesting and recursion experiment is a 
test of Chime’s ability to handle nested 
parallelism, especially in the case the pro- 
gram is recursive. We use a variant of the 
Fast Fourier Transform (FFT) algorithm for 
this experiment. This variant is called the 
Iterative FFT and has a significant compu- 
tational complexity and hence scope for 
parallization. To calculate the Fourier 
Transform of an N-vector, we first compute 
the Fourier transform of two halves of the 
vector and then combine the results. The exact details of the FFT algorithm and its 
complexity analysis are omitted due to the space constraints. 

This recursive program is written by writing a subroutine called ComputeFFT( ). 
This subroutine accepts the input vector size and the vector and then splits it into two 
parts and calls itself recursively and does it twice. Each invocation of the recursive 
call runs in parallel. Thus the program starts as one thread and then splits into two and 
then splits to 4 and so on, till the leaf nodes compute the FFT of 2 elements. For a 
data size of 32K (215) elements the recursion tree is 15 deep, the number of logical 
threads generated, is about 32,000. 

The following pseudo code illustrates the core of the parallel algorithm. 

ComputeFFT ( n, vector [size 2n ] ) { 
if (n==l) compute the FFT 
else { 

Divide vector into odd and even halves; 

par { // ** parallel step ** 

ComputeFFT (n-1 , odd-half -of -vector) ; 
ComputeFFT (n-1 , even-half -of -vector) ; 

} 

} 

assimilate results; 
return results to caller; 
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Fig. 3. Fault Tolerance Experiment 
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Input Data Size 


Execution Time (seconds) 


Percent speedup 
on three nodes 


1 node 


3 nodes 


2 15 


17 


6 


183 % 


2 16 


30 


8 


275 % 


2" 


57 


13 


338 % 


2 ‘s 


108 


20 


440% 



Note that the resulting parallel program is easy to write, readable and very elegant. 
This is one of the main appeal of Chime. The above program is run on data sets 
ranging from size 32K elements (215) to 265K (218) elements. The execution envi- 
ronment was three IBM Intellistation machines with Pentium-Ill 400 machines with 
128MB of memory. The results are summarized below. 

This experiment shows the ability of Chime to handle nesting and recursion, a feature 
that makes writing parallel programs simple and is not available on any NOW plat- 
form. Note the super linear speedup for the last two executions. This is due to the 
availability of large memory buffers when using three machines. While one-machine 
executions do page swapping, the buffering scheme built into Chime allows three 
machines to buffer the data and avoid having to use the paging disk. This causes a 
better than expected speedup, on some applications. 



5.5 Fine Grain Execution Test 

The fine grain test was run using two scientific applications, LU decomposition and 
computing Eigenvalues. We present the results from the LU Decomposition program. 
LU decomposition consists of transforming a matrix for a solution of a set of linear 
equations. The matrix transform yields a matrix whose lower triangle consists of zero. 
During forward elimination phase to reduce matrix A into L (Lower) and U (Upper) 
triangular matrices, each worker node works on subset of rows below the current 
pivot row to reduce them to row echelon form. The code that does this transformation 
is shown below: 

for (i = 1 to N) {// N is the size of the array 
max = abs(a[i, i] ) ; pivot = i; 
for (j = (i+1) to N) { 

if abs (a [ j , i] ) > max { 
max = abs (a [ j , i] ; 
swap rows i and j ; 

} 

} 

// at this point a[i, i] is the largest 
// element in the column I 

parfor (p = 1 to maxworker) { // ** parallel step 
find start and end row from values of p and i; 



* * 
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for j = start to end { 

use the value in a[i,i] and a [j , i] 
to set the element a [ j , i] to zero 

} 

// now all elements in column i from 
// row i+1 down, is zero 

} 

As before, the code for the program is simple and readable and the structure of the 
parallelism is obvious. The program creates a set of maxworker threads on each itera- 
tion through the matrix. Depending on the value of N, these threads do varying 
amount of work, but the maximum work each thread does is about N simple arithme- 
tic statements. Hence the program dynamically creates a lot of threads, in a sequential 
fashion and each thread is rather lightweight. Hence it is a test of Chime’s ability to 
do fine-grain processing. 

The above program was executed on three Pentium- 133 machines with 64MB 
memory. As shown in the following table, the time to run on three machines, under 
Chime ranges between 30% and 150% faster. But the gap between single node and 
three nodes has been gradually reducing as the matrix sizes are increasing. This may 
be due to the fact that the communication overhead is more than the computational 
power desired on the virtual nodes. 



Input Data Size 


Execution Time (seconds) 


Percent speedup 
on three nodes 


1 node 


3 nodes 


100x100 


46 


19 


142% 


200x200 


94 


45 


109% 


300x300 


156 


89 


75 % 


400x400 


242 


172 


41 % 


500x500 


349 


272 


28 % 



6. Conclusions 

Chime is a highly usable parallel processing platform for running computations of a 
set of non-dedicated computers on a network. The programming method used in 
Chime is the CC++ language and hence has all the desirable features of CC++. This 
papers shows, through a set of experiments how the Chime system can be used for a 
variety of different types of computations over a set of diverse scientific applications. 
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Abstract. Many scientific applications manipulate large amount of data and, 
therefore, are parallelized on high-performance computing systems to take ad- 
vantage of their computational power and memory space. The size of data pro- 
cessed by these large-scale applications can easily overwhelm the disk capacity 
of most systems. Thus, tertiary storage devices are used to store the data. The 
parallelization of this type of applications requires understanding of not only the 
data partition pattern among multiple processors but also the underlying storage 
architectures and the data storage pattern. In this paper, we present a meta-data 
management system which uses a database to record the information of datasets 
and manage these meta data to provide suitable I/O interface. As a result, users 
specify dataset names instead of data physical location to access data using optimal 
I/O calls without knowing the underlying storage structure. We use an astrophysics 
application to demonstrate that the management system can provide convenient 
programming environment with negligible database access overhead. 



1 Introduction 

In many scientific domains large volumes of data are often generated or accessed by large- 
scale simulation programs. Current techniques dealing with such I/O intensive problem 
use either high-performance parallel file systems or database management systems. 
Parallel file systems have been built to exploit the parallel I/O capabilities provided by 
modern architectures and achieve this goal by adopting smart I/O optimization techniques 
such as prefetching | i|, caching 0, and parallel I/O 0. However, there are serious 
obstacles preventing the file systems from becoming a real solution to the high-level data 
management problem. First of all, user interfaces of the file systems are low-level which 
forces the users to express details of access attributes for each I/O operation. Secondly, 
every file system comes with its own set of I/O interface, which renders ensuring program 
portability a very difficult task. The third problem is that the file system policies and 
related optimizations are in general hard-coded and are tuned to work well for a few 
commonly occurring cases only. 

At the other end of using database management systems, a database provides a layer 
on top of file systems, which is portable, extensible, easy to use and maintain, and that 
allows a clear and natural interaction with the applications by abstracting out the file 
names and file offsets. However, their main target is to be general purpose and cannot 
provide high-performance data access. In addition, the data consistence and integrity 
semantics provided by almost all database management systems put an added obstacle 
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Visualization 




Inquired Current Status/ 

/ / Parameters 

Optimal I/O Hints/ / Amss Hints 

Data Associations , , 

/ / File Data\ 

File Data/ / Storage Hint ' 



Database, Meta data 
Access Pattern, History 



Local, Remote Disks, Tapes 
SRB, HPSS (interfaces ...) 



Fig. 1 . The meta-data management system environment contains three key components. All three 
components can exist in the same site or can be located distributedly. 



to high performance. Applications that process large amounts of read-only data suffer 
unnecessarily as a result of these integrity constraints 0|. 

This paper presents preliminary results for our ongoing implementation of a meta- 
data management system (MDMS) that manages meta data associated to the scientific 
applications in order to provide optimal I/O performance. Our approach tries to combine 
the advantages of file systems and databases and provides a user-friendly programming 
environment which allows easy application development, code reuse, and portability; 
at the same time, it extracts high performance from the underlying I/O architecture. It 
achieves these goals by using the management system that interacts with the parallel 
application in question as well as with the underlying hierarchical storage environment. 

The remainder of this paper is organized as follows. In Section Cl we present the 
system architecture. The details of design and implementation is given in Section 01 
Section Elpresents preliminary performance numbers using an astrophysics application. 
Section Olconcludes the paper. 

2 System Architecture 

Traditionally, the work of parallelization must deals with the problem of data structure 
used in the applications and the file storage configuration in the storage system. The 
fact that these two types of information are usually referred from the off-line docu- 
ments increases the complexity and difficulty of application development. In this paper, 
we present a meta-data management system (MDMS) which is designed as a active 
middle-ware to connect users’ applications and storage systems. The management sys- 
tem employs a database to store and manage all meta data associated to application’s 
datasets and underlying storage devices. The programming environment of the data man- 
agement system architecture is depicted in Figure Q These three components can exist 
in the same site or can be fully distributed across distant sites. 
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MDMS provides a user-friendly programming environment which allows easy ap- 
plication development, code reuse, and portability; at the same time, it extracts high 
I/O performance from the underlying parallel I/O architecture by employing advanced 
I/O optimization techniques like data sieving and collective I/O. Since the meta-data 
management system stores information describing both application’s I/O activity and 
the storage system, it can provide three easy-programming environments: data trans- 
parency through the use of data set names rather than file names; resource transparency 
through the use of the information about the abstract storage devices; and access function 
transparency through the automatic invocation of high-level I/O optimization. 



3 Design and Implementation 

The design of the MDMS is aimed to determine and organize the meta data; to provide 
a uniform programming interface for accessing data resources; to improve I/O perfor- 
mance by manipulating files within the hierarchical storage devices; to provide a graphic 
user interface to query the meta data. 

3.1 Meta-data Management 

There are four levels of meta data considered in this work that can provide enough 
information for designing better I/O strategies. 

Application Level Two type of meta data exist in this level. The first describes 
users’ applications which contains the algorithms, structure of datasets, compiling, and 
execution environments. The second type is the historical meta data, for instance, the 
time stamps, parameters, I/O activities, result summary, and performance numbers. The 
former is important for understanding the the applications and the management system 
can use it to provide browsing facility to help program development. The historical meta 
data can be used to determine the optimal I/O operations for the data access of the future 
runs. 

Program Level This level of meta data mainly describes the attributes of datasets 
used in the applications. The attributes of datasets includes data type and structure. Since 
similar datasets may potentially perform the same operations and have the same access 
pattern, the dataset association provides an opportunity for performance improvement 
both on computation and I/O. The meta data with respect to I/O activity at this level 
includes file location, file name, I/O mode, and file structure. 

Storage System Level For hierarchical storage system, the storage and file system 
configuration are considered as valuable meta data. In a distributed environment, since 
the storage device may not locate at the same site as the machine that runs the application, 
the meta data describing remote systems must be captured. The meta data at this level 
mainly deal with the file attributes among different physical devices and can be used to 
make a suitable I/O decision by moving files within the storage system aggressively. 

Performance Level Besides the historical performance results, other valuable meta 
data includes I/O bandwidth of hierarchical storage system, bandwidth of remote con- 
nection, and performance of programming interfaces. The meta data that directly affects 
the I/O performance of parallel applications is the dataset processor partition pattern and 
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APPLICATION TABLE astro3d RUN TABLE 




Fig. 2. The representation of meta-data in the database. The relationship of tables is depicted by 
the connected keys. Dataset objects with the same access pattern are associated together. 



data storage pattern within the files. These two patterns can be used to determined the 
collective or non-collective I/O. In this work, we build the MDMS on top of MPI-IO 
while the proper meta data is passed to MPI-IO as file hints for further I/O performance 
improvement. For applications performing a sequence of file accesses, the historical 
access trail is typically useful for the access prediction. 

We use a relational database to store the meta data and organize them into relation 
tables. FigureElshows several tables of our current implementation. For each registered 
application, five tables are created. Run table records the used run-time parameters for 
each specific run. The attributes of datasets used in the applications are stored in dataset 
and access pattern tables. Multiple datasets with the same structure and I/O behavior 
in terms of size, type, and partition pattern are associated together. In this example, 
two datasets, temperature and pressure , are associated together to the same row in the 
access pattern table. For the same I/O operations performed on the associated datasets, 
some resources can be re-used, eg. file view, derived data type, or even sharing the same 
file. The execution table stores all I/O activities for each runs. The storage pattern table 
contains the file locations and the storage patterns. 



3.2 Application Programming Interface 

The implementation of the meta-data management system uses a PostgreSQL database 
|p) and its C programming interface to store and manage the collected meta data. User 
applications communicate with MDMS through its application programming interface 
(API) which is built on top of PostgreSQL C programming interface. Since all meta-data 
queries to the database are carried out by using standard SQL, the overall implementation 
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Table 1 . Some of the MDMS application programming interfaces. 



Function name 


Argument list 


Description 


initialization 


(appName, argc, argNames, 
argValues) 


Connect to database, register application, record a new run with 
a new run id 


create_association 


(num, datasetNames, dims, sizes 
pattern, numProcs, eType, handel) 


Store dataset attributes into database, create dataset association, 
return a handle to be used in following I/O functions 


get_association 


(appName, datasetName, 
numProcs, handle) 


Obtain the dataset metadata from the database, the return handle 
will be used in the following I/O calls 


saveJnit 


(handle, ghosts, status, extrainfo) 


Determine file names, open files, decide optimal MPI-IO calls, 
define file type, set proper file view, calculate file offset 


loadJnit 


(handle, pattern, ghosts, status) 


Find corresponding file names, open files, decide optimal calls, 
set proper file view, set file type, find file offsets 


save 


(handle, datasetName, buffer, 
count, dataType, iterNum) 


Write datasets to files, update execution table in database 


load 


(handle, datasetName, buffer, 
count, dataType, iterNum) 


Read datasets from files 


save.final 


(handle) 


Close files 


load.final 


(handle) 


Close files 


finalization 


0 


Commit transaction, disconnect from database 



of the MDMS is portable to all relational databases. Table [D describes several APIs 
developed in this work. 

MDMS APIs can be categorized into two groups: meta-data query APIs and I/O op- 
eration APIs. The first group of APIs, initialization, create ^association , get-association, 
and finalization, is used to retrieve, store, and update meta data in the database system. 
Through this type of APIs, user application can convey information about its expected 
I/O activity to the MDMS and can request useful meta data from the MDMS including 
optimal I/O hints, data location, etc. Although user’s application can use the inquired 
information to negotiate with the storage system directly, it may not be reasonable to 
require users to understand the details of the storage system to perform suitable I/O op- 
erations. Since the MDMS is designed to contain necessary information describing the 
storage system, its I/O operation APIs can act as an I/O broker and with the resourceful 
meta data inside the system this type of APIs can decide appropriate I/O optimizations. 

FigureEla) shows a typical I/O application using the MDMS APIs. Through calling 
create -association or get -association, user applications can store or retrieve meta data. 
Functions saveJnit and loadJnit set up proper file view according the access patterns 
stored in the handle. Then, a sequence of I/O operations can be performed on the same 
group of associated datasets using save and load. 

3.3 I/O Strategies 

The design of I/O strategies focus on two levels of data access: I/O between memory 
and disk and I/O between disk and tape. The definition of data movement within a 
hierarchical storage system is given in Figure 0b). 

Data Access Between Memory and Disk For the parallel applications, the I/O 
costs is mainly determined by the partition pattern among processors and the file storage 
pattern in the storage system. When the two patterns are matched, the non-collective I/O 
performs best. Otherwise, non-collective I/O should be used. The MDMS I/O interface 
is built on top of MPI-IO Q. The fact that MPI-IO features provide its I/O calls for 
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initialization) ) 

create_associate( ) get_associate( ) 



save_initial( ) 



savejfinal) ) 



load_initial( ) 




load_final( ) 



finalization) ) 
(a) 




Fig. 3. (a) A typical execution flow of applications using the meta-data management system's APIs 
to perform I/O operations, (b) Data movement in the hierarchical storage system. 



different storage platforms leads the implementation of MDMS I/O API to focus on 
I/O type determination. Other I/O strategies including data caching in the memory and 
pre-fetching from the disk can also be used to reduce the I/O costs. 

Data Access Between Disk and Tape For data access of a single large file, the sub- 
filing strategy 0 has been designed which divides the file into a number of small chunks, 
called sub-files. These sub-files are maintained and transparent to the programmers. The 
main advantage of doing so is that the data requests for relatively small portions of the 
global array can be satisfied without transferring the entire global array from tape to 
disk. For accessing a large number of smaller files, we have investigated the techniques 
of native SRB container and proposed a strategy of super- filing |£j|. 

3.4 Graphic User Interface 

In order to provide users a convenient tool for understanding the meta data stored in the 
MDMS, we have developed a graphic interface for users to interact with the system |9J. 
The goal of developing this tool is to help users program their applications by examining 
the current status of underlying dataset configurations. 

4 Experimental Results 

We use the three-dimensional astrophysics application, astro3d am, developed at Uni- 
versity of Chicago as the testing program throughout the preliminary experiments. This 
application employs six float type datasets for data analysis and seven unsigned char- 
acter type datasets for data visualization. The astro3d is performed in a simulation loop 
where for every six iterations, the contents of those six float type datasets are written 
into files for data analysis and checkpoint purposes and two of the seven unsigned char- 
acter type datasets are dumped into files for every two iterations to represent the current 
visualization status. Since all datasets are block partitioned in every dimension among 
processors and have to be stored in files in row major, collective I/O is used. Let X, Y, 
and Z represent the size of datasets in dimension x, y and z, respectively, and N be the 
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Table 2. The amount of data written by astro3d with respective to the parameters iV(number of 
iterations), X, Y, and Z { data sizes in three dimensions.) 







X xY x Z 


N 


No. I/O 


64 x 64 x 64 


128 X 128 X 128 


256 X 256 X 256 


6 


33 


20.97 Mbytes 


167.77 Mbytes 


1.34 Gbytes 


12 


56 


35.13 Mbytes 


281.02 Mbytes 


2.25 Gbytes 


24 


102 


63.44 Mbytes 


507.51 Mbytes 


4.06 Gbytes 


36 


148 


91.75 Mbytes 


734.00 Mbytes 


5.87 Gbytes 


48 


194 


120.06 Mbytes 


960.50 Mbytes 


7.68 Gbytes 



number of iterations. Table[2|gives the amount of I/O performed with different values of 
parameters specified. The performance results were obtained on the IBM SP at Argonne 
National Laboratory (ANL) while the PostgreSQL database system is installed on a 
personal computer running Linux at Northwestern University. The parallel file system, 
PIOFS III 1 1 . on the SP is used to store the data written by astro3d. The experiments 
performed in this work employ 16 compute nodes. 

Given different data size and iteration numbers, we compare the performance of 
original astroSd and its implementation using MDMS APIs. The original astro3d has 
already been implemented using optimal MPI I/O calls, that is, collective I/O calls. 
Therefore, we shall not see any major difference between the two implementations. 
However, our MDMS will outperform on other applications if they do not optimize their 
I/O. Figure 0 gives the performance results of overall execution time and the database 
access time for two data sizes with five iteration numbers. For the case of using 256 x 
256 x 256 data size, the total amount of I/O is from 1.34 to 7.68 Gbytes and the overall 
execution time ranges from 100s to 900s seconds. Since the connection between the IBM 
SP and the database is through the Internet, the database query times show variance but are 
all within 3 seconds. Comparing to relatively larger amount of I/O time, the overhead of 
database query time become negligible. Although using MDMS can result the overhead 
of negotiation with database, the advantage of dataset association can save the time of 
setting file views and defining buffer derived data types. For this particular application, 
astro3d, this advantage of using MDMS over the original program can be seen from the 
slight performance improvement shown in the Figure. 

5 Conclusions 

In this paper, we present a program development environment based on maintaining 
performance-related system-level meta data. This environment consists of user’s ap- 
plications, the meta-data management system, and a hierarchical storage system. The 
MDMS provides a data management and manipulation facility for use by large-scale 
scientific applications. Preliminary results obtained using an astrophysics application 
show negligible overhead of database access time comparing to the same application 
with I/O optimal implementation. The future work will extend the MDMS functionalities 
for hierarchical storage system including tape and remote file system. 
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Fig. 4. Execution time of astro3d application and the database access time using MDMS. The 
timing results was obtained by running on 16 processors. 
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